>> Michael Auli: So today we have Liang Huang here as a speaker, and Liang is currently an assistant professor at the City University of New York. Before, he was at ISI, and I think before that he was at UPenn. >> Liang Huang: I was at Google. >> Michael Auli: At Google, as well, very briefly. Yes. And Liang is very well known for his work on large-scale discriminative training, which he will be talking about today, but also on parsing and on efficient algorithms for machine translation. So with about further ado, please join me in giving Liang a warm welcome. >> Liang Huang: Thank you. Thank you, Michael. Thanks, everybody, for coming to my talk on a Friday morning, and I was giving another talk at UW yesterday on parsing and machine learning, and this talk today will be on the application of those algorithms for translation. And the title sounds very technical, like max-violation perceptron, blah, blah, blah for scalable blah, blah, blah, but the real kind of more intuitive version of the title says Large-Scale Lexicalized Discriminative Training for Machine Translation is finally made successful for the very first time. That last thing I think is the kind of takehome message, take-home version of the talk. And before I talk about anything technical, I always have a lot of jokes for translation, and this time I will show these jokes in a way that you can actually tell what kind of technology is behind it and what kind of error is behind it. So this first one is clearly an ATM machine in China, but the sign reads Help Oneself Terminating Machine. But if you look at a Gauss, it's actually not that bad. If you read Chinese, so [indiscernible] came in, it's not that bad. So it's self-help terminal device. But it's not something that you can help yourself to terminate yourself, so this means that translation has to be done in context, and ideally, it should be done with understanding on the source language. So this is basically a word-based translation, and I think from a very old online website or whatever. That's even before phrase-based translation. Maybe it's rule based. I don't know. But the next few examples I think are phrase-based examples from either Bing Translator or Google Translation, so seating reserved for consumption of McDonald's guest only. This is a typical PP problem on the source side. This is clearly not human translation, right? Nobody is going to be that creative, and please check out the cashier. That's apparently a phrase-based translation, a very typical of phrase boundaries, and tons of stuff like shaoxing, be careful. In China, people always ask you to do these dangerous activities carefully in case you have to do it, right? So slip carefully, fall into water carefully, blah, blah, blah. And if you try them on Google or Bing, you get roughly speaking the same. You get carefully slip for this one. You get fall into water carefully, something like that. So my rule of survival is that if you don't read Chinese and you go to China, if you see something like X carefully, where X is a verb phrase, just don't do it. You'll be fine. And this problem is more interesting. So why is it click here to visit? Do you guys know why? Because it's trained on web text, and you have tons of click this button, click this link, to enter something. And this is in a museum, in a Chinese museum, so it should be like go here or follow this direction, but it's a domain adaptation problem. So that's very typical. And you can actually see -from these jokes, you can actually debug what's going on behind the scenes. This one, actually, I couldn't figure out what kind of problem it is, but it's very funny, explosive dog. I couldn't figure out what's the technology behind it. But my all-time favorite must be this one, translate server error. I guess most people are already very familiar with. That's the cafeteria in China, and unfortunately maybe the Bing Translator server was down that day and he didn't know that, and he just put it up. It's becoming the most famous cafeteria in China, called Translate server error cafeteria. But I like these examples not just because they are funny, but also because they are the best evidence that MT technology is used in people's daily lives, because look at this and look at these. Look at these list examples and these examples, and this one and that one, there's no way that a human being can translate these things, regardless of how bad his English is. There's no way that he could be that creative. So this is clearly machine translation, and you can see a lot of problems with machine translation that we work on, like domain adaptation, like language model, like the syntax and semantics of the source side and stuff like that. So, really, machine translation is becoming more and more useful and involved in people's daily lives, but its quality is not good enough, so what can we do? Sure. >>: I think there's a counterexample. Are you familiar with a book called English as She is Spoke? It's a guide on how to translate from Portuguese into English, written by a guy who only spoke French and Portuguese, so humans can make really, really crazy translations if they don't really speak the language. >> Liang Huang: Okay, maybe there is more levels. >>: You should check it. There are some great translation examples from it. >> Liang Huang: Sure, sure, sure. It may be very funny. Maybe next time I will have those examples. But if you look at these examples, they all involve the Chinese word shaoxing, which means either be aware of, be cautious or be careful. But, really, it has to be done in a context, so that you know the syntactic category of the phrase after shaoxing, it's either a noun phrase, and it's like be aware of dog, but if it's a verb phrase, it should be be careful not to do something. So really, you should know the syntactic category of this word, of this phrase. So translation, you need context as a rule selection. So how do you encode this knowledge in our translation systems, like, say, phrase-based translation? We often use some features of the context, like say, if the next word is noun phrase, then this shaoxing should be be aware of. Otherwise, it should be be careful not to. This knowledge, it can be encoded as context sets of features to guide our rule selection, but how do you train a system with so many features, because you can imagine you have a lot more features. You have very rich features like what if the next noun has the as the beginning of the noun and has dog as the head word and blah, blah, blah. You can have millions of features like this, right? So we have to do this discriminative training with so many features, and discriminative training has been a difficult task, a central problem in machine translation, and it started with MERT from more than 10 years ago and then Percy Liang did standard perceptron to train it on a training set, which is much larger than a dev set, which I think is a good direction, but it failed miserably. It didn't work out. But then people completely abandoned this line of work and switched back to the dev set, and you have MIRA, you have PRO, you have HOLS, you have many others. You have Michel Galley's work on regularized MERT and other variations of MERT, which works better than MERT, for sure, but they're all trained on the dev set. And if you just train it on the dev set, you cannot afford to have many features, because dev set is really too small, like 1,000 sentences. How can you see a combination of try one on these 1,000 sentences? It's very unlikely that you'll see it again on the test set. So you've got a data sparseness problem here, so really we should get back to that direction to train it on a training set, so that you can have millions of features and so on and so forth. But it's been so hard that nobody is following up on that line. So finally, after seven years, we did it successfully, using a different version, using kind of a specialized perceptron, which is designed for search errors problems, designed for a problem with heavy search errors, because MT is all about search error. Search, like phrase-based translation and syntax-based decoding, the search space is just humongous. And you have to use beam search, you have to use prune, you have to use a lot of approximate search methods to make it tractable, but those methods unfortunately introduce a lot of search errors. And our learning algorithms, like it's perceptron, blah, blah, blah, they don't deal with search errors that well. They are not designed to handle search errors. So my work, why it succeeded, is because we are the first ones to learn to accommodate search errors. In a sense, we want to live with search errors. We cannot get rid of search errors, because we use the same search, same decoding algorithm in both training and testing, and your same beam size, for example. So the search quality is fixed. So search for us is fixed. You cannot even improve it, because unless you can increase the beam size or use a much better search algorithm, the search quality stays the same. The only thing you can do is change the learning to accommodate the search, to be robust to the search errors. So that's our contribution, so we changed the learning algorithm to adapt to search errors so that we can train our stuff on something really fast but really bad search, like phrase-based translation. Really bad search. It's almost like greedy search. Okay, that's kind of the general idea. And why standard perceptron doesn't work out well? It's because, as I said, the theory is based on exact search. It assumes that your search is kind of perfect, but MT has such a huge search base, and as I said, full updates like perceptron, MIRA, PRO, they all do full updates in the sense that they always update on full sequence. It doesn't deal with search errors. So what we should do, we should have some kind of thing, mechanism, to address the search errors in the middle. Question? >>: What if people had -- I mean, so you're assuming that the problem really is a huge amount of search error. There are people who have claimed that if you run with very wide beams, etc., etc., you see that in fact ->> Liang Huang: You get better search quality. >>: But search error is not necessarily such a huge issue. >> Liang Huang: I will convince you at the end of this talk. We have statistics, we have plots to convince you that even if your beam size increases by a lot, it doesn't help. >>: What about the noisy training data? So one of the reasons why distributive training might have a problem is you're trying to fit this training data, and the training data can be ridiculously bad. >> Liang Huang: Yes, so I have another kind of small method to address that problem, which is forced decoding. I will talk about that in a minute, but that's the kind of byproduct. The main idea is to address search errors, and the argument is that the original complexity of phrase-based translation, for example, is exponential, 2 to the N, N-squared, something like that. And you shrink it into linear time beam search. You pay a huge cost. You've got the speed, but you sacrifice the search quality by introducing a huge amount of search errors. Now, you double the beam size, doesn't help that much. You're kind of 10 times the beam size. It doesn't help that much. The complexity is really 2 to the N. You can't do it unless your beam size goes upwards. Otherwise, if it's constant, it just doesn't help that much. That's the difference between your very easy search algorithm, search problem, like part-of-speech tagging. There, you increase the beam size, you almost get it perfectly correct. Machine translation, it's just impossible. Even if you do syntax based, you have cubic time and you shrink it into linear time, it just doesn't help you that much. So my argument is just you can't fix that much search errors, even if you have a very large beam, at least for phrase based. Okay. Okay, then our whole point is we want to use some partial updates or prefix updates up to the point of search errors. Not all the way to the end, because if you just do full updates, it just doesn't address the problem in the middle, so really you should focus on somewhere in the middle, where the search is so bad. That's our intuition. And then we used forced decoding as a guidance to update towards. That we will talk about in detail in a minute. And the end result is that we scale to a very large portion of the training data, and we can use more than 20 million sparse features. I think that's the largest size in the literature in the online learning fashion, and we've got more than two points of BLEU over MERT and PRO, so that's the final result. So let's see, how can I deliver that story? So I'll first discuss structured classification with latent variables as a model to train MT, because in MT we have input and output in the powered text, but we don't have the derivations annotated. How do you get from the input to output? It's completely hidden, and that's the latent variable, and we used forced decoding to address that, so we will use phrase-based translation as an example and use forced decoding to compute latent variables. And a central piece of this talk is about how do you learn to accommodate a huge amount of search errors. That's the new learning algorithm, and we use some new stat-based strategies like early update and max-violation update, and we designed some rich feature sets to kind of learn from the data to be context sensitive, and we have experiments to come. Okay, so the whole story of structured learning, I would start with structured perceptron, because that's by far the simplest algorithm for structured prediction. It's much simpler than CRF or structured SVM and stuff like that. So if you understand structured perceptron, it's enough. So this is extended from binary classification and binary perceptron, so structured classification is just like you have input, and the output could be millions of classes. You can imagine it's still classification, but just the output is so many classes. And it looks like the exact same architecture with ordinary perceptron, except that this box currently is much harder than this box, here, because it used to be just two classes or 10 classes, and it's trivial. Now, you have exponentially many classes for each input, so we often use dynamic programming, CKY or whatever, like phrase based, you have dynamic programming, but they are still too slow. They are still too slow. So what we often do is to -- because this is exponentially large, we often have an inexact or approximate inference box or bad search box, like greedy search, beam search, to replace the exact inference, which we cannot afford. And that will have a detrimental effect on the learning part, because the learning is really not designed to handle inexact search. Okay, so there are two challenges here to apply this story to MT. One is the inference box is too slow, so we have to do approximate, but another problem is that the correct derivation is also hidden, which is the latent-variable part, which I'll talk about in the next slide. Okay, so how do you get from the input to the output that's hidden? Okay, so we have to kind of extend perceptron a little bit to introduce latent variables, to handle latent variables, and that's actually found in previous work by Percy Liang and other people. So let's say we have this training example in our training data [indiscernible] is Chinese input, and this is the man bit the dog. And at training time, during online learning, the perceptron, you try to decode or try to translate this input using your current model. And in the full search space, you find the highest-scoring derivation according to your current model which will lead to a translation, the dog bit the man, which is different from the reference translation, and you realize that, oh, I made a mistake. I should update. And how do you update? Well, normally speaking, you have something to update towards, the positive signal here, but currently you don't have a positive signal, because you don't know the derivation. Now, what do you do? Because there are millions of ways that you can translate this input to this output. Which one should you prefer? Well, the simplest thing to do is you prefer the one that is scored highest by the current model. So you do a forced decoding, so that space is a much smaller subset of the original forced space, and every single derivation in the forced-decoding subspace produced exactly the same reference translation, and you just search for the one that is highest in score according to the current model, so that's our positive signal. So we just do an update to reward that derivation, the positive derivation, and penalize the wrong derivation. So that's just a reward correct and penalize wrong. That's just like normal perceptron, except that this part, you have to do another decoding, which is called forced decoding, which is on a much smaller space, and this decoding is the original decoding or unconstrained decoding, or the real translation decoding. Okay, so that's the main idea. The problem, though, is that we cannot afford to do a full search. Now, we can only do a beam search here, which is very narrow and very likely the correct translation, the highest-scoring correct translation falls off the beam very early on, very easily. In that case, if we just update that way, it just doesn't work. So because there are search errors here, you should really address some search errors here, so the next time the model would guide the search so that the BLEU guy, the BLEU dotted derivation, would not fall off that early, would actually survive the search much longer. So, likely, if the BLEU derivation survived the whole search, then you would be able to produce a correct translation. So that's the whole idea, to address the problem of the beam search and the errors introduced by beam search, and how would you do it? You cannot do a full update. You cannot do standard perceptron, so we have to do something new. But before I talk about that learning part, I'll first give you the kind of brief intro about phrase-based translation. Sure. >>: Do we have such error in forced decoding, as well? >> Liang Huang: You could, you could, but right now we do an exact search for forced decoding because the search base is much smaller, because it's constrained to produce exact output. So for now, I think we don't need much pruning here. Maybe a little bit, tiny little bit, but here you use tons of searches. >>: Do you leave one out in your forced decoding space? >> Liang Huang: You could. You could just use the rules strategy from other sentences. We do that for small data sets. For large data sets, it's not that important, so we just leave out the one-count rules. Otherwise, it would remember too easily the sentence. But we could have some search errors here. I think that's okay. That's okay, as well. Very good question, so any other questions? Good. Now, I will do a brief intro about the searching in phrase-based decoding, which I think for this audience, I don't need to do those slides, but I just quickly want to go through it. So phrase-based decoding, you have states like this, which just says no words is covered, and then you can cover the first word, and then you can cover these last three words and can jump around, and that's why it's 2 to the N in exact search. At least the stage space is 2 to the N, right? It's just like traveling salesman. You have to cover one word once and only once, and so on and so forth. And that's one derivation, and you can have other derivations, and you have a graph, and there are many paths from the beginning to the end. That's why it's exponential. Now, that's not the full story. The full story has language model in it, so you have to split each state in the original space by adding the last word being translated, if you have a background model, so Bush -- and these three states used to be the same state. Now, they are three different states, because they have different last words being translated. So that you can add language model cast when they extend to the full translation and so on and so forth, so you have a lot more states than before, after you introduce the language model. The space gets even bigger, but either way, it's at least exponential. So to make it tractable, we use beam search to make it linear time, so that's why it has a huge amount of search errors. So at each step, you allow like, say, five guys. The beam size may be five or 10, and this five is all these guys are covering one words, all of these guys are covering two words and so on and so forth. That's what we're doing in practice. Now, from decoding to forced decoding, you are basically trying to say, what if I have a much smaller space, constrained by the constraint that you have to produce exact output? Now, it's much easier to search, because you can suddenly prune away all these guys which kind of violates this constraint, so you can only have talks here. If you have meeting or talk, which is not found in reference, just completely delete them away. Then you only have one derivation in this space, but you could have millions of derivations in the forced decoding space, so actually we store a small lattice. So the full space is a big lattice. Now, the constrained version, forced decoding, is a much smaller lattice, but it's still lattice, so you still have exponentially many stuff, like Bush held talks, or Bush held talks in one rule and so on and so forth. You still have millions of paths in the forced decoding lattice. Okay, this is assuming that you could produce the exact output. What if you could not? What if you don't have even one derivation that could produce the output? That actually happens a lot, so I give you this example. Phrase-based translation has a limit of distortion, distortion limit, where it says you cannot jump too far. You cannot jump more than four steps in one jump, so that's to make it tractable. So this sentence pair is perfect. It's like United Nations sent 50 observers to monitor, but then there is a big jump -- to monitor what? To monitor the first election, which is mentioned last on the Chinese side, and then you go back and jump back and stuff like that, but this jump is too far. It's five steps, five words, so it's disallowed by our distortion limit of four, for example, and this whole sentence is not reachable, is not reproducible in the sense that we cannot even have one derivation that can produce the output. What can we do? But this sentence is really perfect, perfect translation. It's not bad translation or whatever. This is very literal translation, so it's a shame that we cannot use the whole data. For now, we have a hack that we just use the prefix, which is perfectly fine, which uses a prefix pair that United Nations sent 50 observers to monitor, but not a full sentence. That would help a little bit. That can recover some of the data. Right. Okay, so here is the statistics for how many sentences are reachable, or in other words have at least one derivation that is correct. It turns out that it is not that many. For the majority of Chinese-English data set, most sentences cannot be reproduced, is not reachable. It depends, of course, on the distortion limit, so if the distortion limit is zero, which means that you have to do monotone translation, then you get only very short sentences covered. If you increase to six, which in our experiments, you have about 20% maybe, and this ratio drops very quickly as soon as it gets longer and longer. Chris? >>: And the reason that you can't achieve sentences even with unlimited distortion is because you also have a limit on your phrase length? >> Liang Huang: That's right, that's right. >>: You can also do this plot with ->> Liang Huang: Phrase length limits. That's a very good point, yes. Because if your alignment is wrong and if garbage collection kind of behavior, then you have to extract huge phrase pairs, which is disallowed by the phrase limit. And so for a lot of sentences, you actually cannot -- it's not reachable, even if you have unlimited distortion. So this curve, we cannot afford to run it for any longer sentences, because it's too slow, and that we would have to use, like you said, beam search, even for forced decoding. For now, because our distortion limit is more constant, it's pretty fast so that we don't need to bother. But it's a shame that for longer sentences, longer than 30 words, the vast majority of sentences are not reachable. It's mostly on the short side. But we also argue that forced decoding that has a byproduct that we can use it as a data selection kind of module, in the sense that we would prune away those nonlegible translations, those translations that have part of the English side not mentioned on the Chinese or part of the Chinese not mentioned on the English, or just the wrong translation, or just kind of noisy translations -- would just be gone by this kind of pruning, filtering. So those remained, who survived the test, are often those you can argue are easy to translate, easier to translate. But they are also more faithful, in a sense. Yes, question. >>: So how much they have survived? >> Liang Huang: Yes, how much they have survived. So let's see, how much they have survived? If you have a small data set, then full sentence reachability is only 20%. If a larger data set, it's about one-third. >>: Is this after significant pruning? You mentioned you do significant pruning. >> Liang Huang: Yes, it's after the significant pruning, I think. But the reason why it's on a small data set it's much worse than a larger data set is because these are trained on the small and large, also. The word alignment quality is much worse on the small data set, and when you get larger data, word alignment is improved. But, anyways, it's only like one-third of the sentences are fully reachable, and they are short sentences. And because they are relatively short sentences, the number of words is actually much smaller, although one-third of sentences, but they only represent 13% of words. So we added back the prefixes, the prefix pairs like this. There are some prefix pairs for those unreachable sentences, but partially reachable sentences. Then we can recover a lot more, so we can have about one-third of the words used in the training. So, finally, we use this part of the training data, like only one-third of the words in the training data. Yes? >>: So can this filtering also severely distort the lowering model? >> Liang Huang: Yes, exactly. That's a very good point. So they would most often just favor a short distortion, more like monotone translation, because they cannot even do anything very long, right? So that's the bigger point. So most of the translation we saw, it will be very easy to translate in the sense that they are more like monotone. So for a perfect translation example like this one, which is a really, really good translation pair, we just cannot afford it. That's a shame. But if we use [HIRO] or other syntaxbased methods, it's a perfect example. It's a textbook example for [HIRO] in this kind of behavior, really textbook example. And I think maybe Michel's other work on phrase-based translation with syntactic distortion or something like that, jumps, with the reduced-style jumps can handle this sentence, but I'm not sure. But I think it's better than just a distortion limit. The distortion limit is just too crude. Longdistance rewordings are so common between English and Chinese, but not between Spanish and English. So if you look at this curve for Spanish-English, it's very different. It's very interesting. So they are not sensitive to the distortion limit, so even if the distortion limit is zero, it's still very good, not too much different from the distortion limit of six, because translation -- these two languages are really just almost the same word order, except for the local swapping, and local swapping is handled by the phrases themselves. So it's like you can have reorderings within phrases, but you don't really need long-distance reorderings between large phrases. And at the 20 years of MT workshopping at MLP, but maybe Peter Brown said the reason why IBM's model succeeded was because French and English are basically the same language. And that is true, I think, for Spanish and English. So, really, you don't see too much interesting stuff going on, unlike Chinese, and that's why Chinese is a much more interesting language to work on, and we also even tried it on Japanese. And could you guys guess how much the reachability is for Japanese? It's so low. It's worse than 10% or 5%. We could not even use it, so we ended up not reporting those results, but that's interesting. For Spanish, it's like this. Anyways, it's more than 50% covered for Spanish. So this is how many derivations are there on average for each sentence, if they are reachable, and depending on the distortion limit, but either way they're exponential, so it's not like just a few numbers, but actually they are huge numbers of derivations, of correct derivations. These are the latent variables, but they are packed in the lattice, basically. So if you just use N best list, I guess it doesn't work that well, because really you have just too many possible derivations. Okay, any other questions before we move on to the learning part? So here's the central part of this talk, is how we can not fix search errors, but accommodate search errors, because you cannot really improve search quality, in my assumptions. You can only learn to live with search errors, to kind of compensate the search errors or reduce the bad effects of search errors. Okay, so let's look back at this picture. So that's how we do the updates. The problem is the correct translation, the correct derivation, falls off the beam very early on, so if you just do it kind of full sequence update, it doesn't work very well. That's well known. That's why Percy Liang's work didn't work out well. That's the main reason, I think, and we have data to support that. And now, the search errors cold be like the gold derivations fall off the beam. Like, for example, this is the gold derivation lattice, and you can imagine that somewhere here, the gold state, the correct state like this state, falls off the beam. They didn't make up to the top four, because their model score is not good enough. In a sense, the model has a problem here, that they should have this guy survived in the search. They should have scored this guy higher up, and another possibility is that this guy, this state, is merged by an equivalent state, which like on a signature is exactly the same but has different derivation. But this is the wrong derivation, and our correct derivation is being merged, so that's another case of search error, and so on and so forth. So in a sense we should address the search errors in the middle of the search by some other update method, by some prefix update method, rather than wait until the very end. If you wait until the very end, you don't see the signal of where is the problem. Okay, so fixing search error, we have two methods. One is a relatively old method called early update. It works, but doesn't work that well. So the idea is very intuitive. So you have a beam, and let's forget about latent variable for now. Let's assume there is only one correct translation. Let's say we are doing part-of-speech tagging or just parsing, which there is a unique answer, unique derivation. Then, what if the correct derivation falls off the beam at, say, step seven, now what do you do? You lost the positive signal, so what previous people, Collins and Roark, said is that you stop and update right here and forget about the rest of sentence. Just update on prefix. So that's called an early update, and why early update works? So actually, early update does work for incremental parsing and others, those kind of beam search tasks. Most of the incremental parsing papers, following that paper, uses early update. And they were found to be much better than standard update, but why? Do you guys know why it works? Actually, nobody knows why it works, and I proved why it works in one of my earlier papers. Two years ago, I found a notion called violation. Then I can prove that early update guarantees that each update is not a violation. A violation basically means that the correct prefix scores lower than incorrect prefix, which should not happen, and in a perfect model, they should score better than anything incorrect, so that's our separability assumption, and anything violating that is a violation, and early update makes sure that each update is a violation, whereas the standard update is not guaranteeing that, because it's very likely that, at the end of the search, the correct derivation actually scores higher than anything in the beam, although in the middle of the search it doesn't survive. But in the sense that the model as a whole does prefer the correct one, so in a sense, if you have exact search, the model would return the correct derivation. It just doesn't survive the beam if your beam is too small. So who is to blame, search or model? Is it model error or search error? In a sense, it's kind of the model is correct, if you have perfect search, but our notion of model error, it's kind of dependent on search, so it's search-specific model error, because you have to live with this particular search quality in both training and testing, so if you make a mistake, it's really still kind of being misled by the model. So the model should kind of lead the search so that the correct guy doesn't fall off and survives all the way to the end. So in a sense, it's still model's problem and you should fix the model to guide the search. Although I know the search is really bad, it will stay as bad as it is, still, your model should guide it toward something as good as you can go, so that's our intuition. So I approve that, as long as each update is a violation on a violation, then you still can have convergence. We have the same theorem, the same guarantee, the same generalization bounds of perceptron, like you have cost and number of updates and stuff like that. This is intuitive, because if the model points up, which is in this case that here is the best one in a beam, like each step, and if you fall off a beam, it's because you would score lower. And each update should point downwards, because you made a mistake and you should have the negative feedback to pull you back to fix the problem. So early updates are correct, the full update is wrong. It's not always wrong, but it could likely be wrong if this guy goes up. And our statistics will show that this actually happens a lot of times. Nobody actually took the pain to really see how many times you have this situation, but ours, it was true that more than half the time you have this situation. That's why if you just do a normal perceptron, standard perceptron, it just doesn't work out of the box. You have to break the search as a white box instead of as a black box. So that's my point. Search and learning should be mingled together. Chris? >>: I'm curious. At some point, are you going to describe how this relates to the CERN work? >> Liang Huang: Yes, CERN and LASSO. >>: Because it seems like the intuitions are really similar, although the details of the algorithm may be substantially different. >> Liang Huang: Yes, LASSO, I'll show it as a special instance of this framework. CERN is still -- I'm not quite sure about CERN. LASSO, I'm pretty sure. LASSO is the precursor of CERN. So this is good, but how does it extend to latent variables? That's our first question. Because MT has latent variables, so you just use unique correct derivations, so you have many derivations. Now you can do something like if you have lattice, imagine you have many, many correct derivations. I just draw two as an example. Then somebody falls off the beam very early on, or somebody stayed in the beam, but as some step, at some point, maybe at step 10, everybody falls off the beam. So at this point, you can be sure that there's no way to recover a correct translation. You can be sure that it's already impossible to reach the reference. At this point, just say stop and make an update, because that's where my hope drops and so that's the new definition of early update, extended. You can still prove it's a violation guaranteed, and so on and so forth, and stop decoding and forget the rest of sentence. Right, okay, so early update works okay. It works okay. It's much better than standard update. It's just not been applied to translation. It's being applied to parsing, mostly, but it has a big problem that it learns very slowly. It converges much slower than the standard update. It converges higher, but it converges much slower, and it is intuitive why it is the case, because it only updates on a prefix, very small prefix. So you get the first word wrong, you stop. You get the third action wrong, you stop in a sense. You just skip too many. You did not take advantage of the rest of the sentence. So the updates are relatively short, and that's why you need more iterations to learn more stuff. So I proposed in a previous work of mine another update method called max-violation, which is also very intuitive, and the idea is to update at a place where the mistake is the maximum. And the mistake is defined as the amount of violations. It's basically at step say, whatever, five or 10, the difference between the best correct derivation and the best incorrect derivation in the beam, if that distance is the maximum across all different steps, then that place is the maximum violation place to update towards. It is intuitively the largest amount of violation or the worst mistake. If you have to fix one mistake, fix the worst mistake or biggest mistake, so that's called max-violation. And it must be to the right of early update, and much -- to the very much right, like roughly speaking, in practice, I found it mostly like 70% of the sentence, and then gradually it would increase to 80%, 90%, but not 100%. So it updates on a much longer prefix, and also because the amount of variations is bigger, you can show but not prove -- you can show in the perceptron proof that mathematically that makes a lot of sense, that convergence should be faster, and it is. In all of our experiments on parsing, tagging, all kinds of parsing, we found that max-violation is always more than three times faster than early update, to reach the same level of accuracy. And also, if you let it run for longer, it always converged a little bit higher than early update, so it's better and faster. Mostly, it's because it's faster, more than three times faster than early update. Because for translation, it's just way too slow to train on a training set. The data is too big. If you use early update, it will cost you weeks, but if you use max-violation, you can do much faster. Okay. So, anyways, all of these instances are kind of special instances of my framework called violation-fixing perceptron, which is a framework designed to handle search errors. And these guys, you can prove they all converge in this framework, because they all point downwards. The updates are in the reverse direction of the current model, so that product is negative, but the standard update is wrong, because it points up, in a sense that they reinforce the error, but not fixing the error. So as long as your update points down, it's going to converge. And LASSO, I can show it's a very simple special case in this framework. So you can prove a lot of theorems in this framework. You can propose a lot of other update methods, and I proposed many others, but it turns out that max-violation always works the best, consistently, over all methods. It's also very easy to define. Actually, when you have latent variables, it's actually a lot easier to define max-violation. Early update, it's actually harder to implement. Okay, anyways, that's the comparison between nonlatent variable and latent variable. Okay, but it's the same idea, just extended. So here's the roadmap of techniques. It started off on structured perceptron. Then, on one side people extended to handle latent variables. Then on the other side, people extend it to handle inexact search. Part of that is my work, and then we just combine it in this work, this latent variable perceptron with inexact search. That's what we do. And we tried it for phrase-based translation, and we are trying it for many other tasks, like parsing, semantics, transliteration, all kinds of stuff. And I argue that it can largely replace part of EM, because part of EM is still dealing with partially observed data, like weakly supervised. You have input and output but not a derivation. And then this framework can largely replace that kind of application of EM. It cannot replace all applications of EM, when even the output is hidden. But if the output is known but just the derivation is hidden, then this framework has a lot of application, because you can define all kinds of features and EM cannot. Okay, so let's get back to the experiments. The features, we have tons of rich features, but that's kind of a relatively boring part. We have dense features, we have rule ID features, we have -- the most important thing is the WordEdges edges features. That's basically kind of the lexicalized translation window, so let's say we are translating this row, R2, right now, and we have already covered R1, Bush. So the first and last word of the Chinese side on the row, the first and last word of the Chinese side in the row, the first and last word on the English side and the boundaries, out of the boundary, like the neighboring words on the Chinese side, there is static information. You can use as much as you want, and you use all kinds of combinations of this information, and all kinds of combos. Nonlocal features is more interesting. You can have rule bigrams that captures the interplay between rules, and it will be more important if you have minimum translation rules, like what you guys did -- Chris, you did, those minimum translation units. And also the current rule and last two words generated and all kinds of combinations of these. So actually, we used only a very little amount of nonlocal features. It's less than like 0.3% of our feature instances, but it helps a lot. It helps almost a one BLEU point. The majority of our features is WordEdges, the static WordEdges, the local WordEdges features, which 99% of our features are WordEdges features. It helps a lot, but nonlocal features also helps a lot, even though it's a very tiny fraction of the features. Okay, our experiments. We have three data sets, Chinese-English, small, large and Spanish-English large, but they are not really large. They should be called medium, because they're [indiscernible] kind of scale. We cannot afford to train even bigger data sets, because it just takes us too long, because this is one of the first works to train really on the training set, and these are the reachability curves I already showed. On Chinese, it's not that great. On Spanish, it's a lot. Even for sentence level, it's more than half. That's why we don't even bother to add prefixes, because it's good enough. So even though it's not that good reachability ratio, it's still a lot bigger than the dev set, so for small data, it's 10 times bigger. For the data set that we care about, it's more than 100 times bigger. For Spanish, three times bigger. So yes, so we will report results on these three data sets. Sure, sure. >>: In the first case, your 30K is the original training data, and then once what's reachable is one-tenth of that, so that's only about 6K sentences, then? >> Liang Huang: You mean this number? >>: Yes, after 20%, so you're only left with -- oh, but you add the prefix back. >> Liang Huang: Yes, and the prefix back. Otherwise, it's too small. Yes. Okay, so here is probably the most interesting results slide. So we compared different update methods, different variations of perceptron, standard perceptron, max-violation perceptron, early update, and there's another one called local update, which I'll explain. But, first of all, standard perceptron is really bad. It actually goes down, and it's much worse than MERT, which is -- Percy Liang reported roughly the same thing. Then Percy Liang had another variant called local update, which is the local update is more like Percy Liang's local and the standard update is more like Percy Liang's bold. So local update is like update towards the best translation in the N-best list at the final stage, final step. So in the final bin, there is some translation that stores lower than the 1-best in the current model, but according to sentence level BLEU, it's actually closer to the reference, so it updates towards that. So you can show that it's still converged, because the update still points down in the framework, but the amount of violations is too small, so that it doesn't work our way, but it's actually much better than standard update, which is consistent with what Percy Liang found. This local update should be better than standard update, but it's not too much. It learns too slowly. Our interesting curves are max-violation and early update. So max-violation is really fast and good, and this is after overfitting on the dev. This kind of BLEU curve only held out on the dev set. So early update, if you let it run for much longer time, maybe after three to five times more, it will reach somewhere close to but not as high as max-violation's highest point, in our experience, but we just could not afford to run it for any longer, because it's just way too slow, so we have to stop here. But either way, max-violation is a lot better, about two points better than MERT baseline here. >>: So I'm curious on the huge difference in the starting point on the left. Is that after one iteration? >> Liang Huang: Yes, this is first iteration's data. First iteration. It's possible that standard update peaks at something like half of an interaction. If I just took half of the data, it actually already is overfitting and it drops, so it's possible there is some peak somewhere in the half, less than one. Yes. But this kind of behavior is well justified in parsing, as well. We see this kind of behavior in parsing, as well. It's kind of shocking. If you first look at it, first of all, why perceptron update is so much worse just on first iteration, and secondly, why it drops down? It's because of this. You have tons of search errors, and most of the updates are invalid, so this curve says how many updates are indeed invalid updates, invalid updates meaning if you go up and you're reinforcing the error? How many times in standard perceptron do you make those bad updates and you're not even aware of that? Well, actually, if your beam is one, which is greedy search, then most of the time. If your beam is 30, which is basically these experiments, you still have about 60% of updates being wrong, and if people just blindly tried perceptron, they are not even aware that most of the updates are wrong, and they're reinforcing the error. You would rather just even skip them, or even that is better, but people never look at the statistics, how many updates are actually wrong. I showed you that it's more than half. And the beam size doesn't help that much, as I said. If you doubled the beam size, it's just going to be a little bit lower, just because the search base is exponential. Really, the full search base is exponential, and if you just increase the beam size, it doesn't help that much. It doesn't help that much. So there is no way to fix search errors that much, to improve search quality that much. This is all due to search errors, right? Otherwise, there was no such behavior of invalid updates. If there is exact search, there is 0%, whereas in tagging you can see this curve goes down to zero very quickly. With beam level five, you see no search errors in tagging, because the search is so simple. In parsing, it's almost like this. It's slightly better than this. In translation, the search is really hard. That's the whole point, the take-home message. Okay, and then we have to scale it up, so we use a parallelized perceptron, which is another paper of mine, by my student Kai, who has an offer from Carnegie to do the internship this summer. So it's much better than, much faster than Ryan McDonald's in terms of parameter mixing. McDonald's work doesn't have much speedup, actually, maybe two times, three times, but we have like seven times speedup, sevenfold speedup, if you use 24 CPUs on one machine. And if you just use six CPUs, you have about four times speedup. That's a lot. That makes our work finally kind of tractable on a large data set. Otherwise, it's not even runnable. Okay, so then we compare the future contributions. >>: How long? >> Liang Huang: How long? The final one, on a full data set, it took about 30 hours, using 24 CPUs. >>: And that's on the biggest data set, like the 150? >> Liang Huang: Yes, it's like FBIS scale. It's not that big. Otherwise -- right now, we have more machines, so we can probably run more experiments, but at that time we don't have a very big machine. So dense features, only 11 features, you're about two points worse than MERT. Now, you add rule IDs, improve one BLEU point with just rule ID features. The most improvement comes from WordEdges, which most of the features actually are in WordEdges. You got more than two points improvement, then you finally beat MERT, and then the final icing on the cake is nonlocal features, only 0.3%, but that gives you almost one point. >>: So why is it that you can't actually reach more with the dense features? >> Liang Huang: Yes, there are two reasons. That was the same question? Great. So Michael asked me the same question, actually, before this talk, and I think there are at least two reasons. First of all, perceptron is not designed to handle those features with different scales. So often people use AdaGrad or something like second-order perceptron, which is better at handling features with different scales. Perceptron is well known to be best at sparse features, just the minimum sparse features. We didn't do anything special here. We just used perceptron. We didn't even bother using AdaGrad. Actually, after that paper, some of my students used AdaGrad. Sometimes, it helps a little bit, sometimes not. It's not that consistent to us, but it's often helping a little bit, especially if you have a mix of dense features and some sparse features. But it's not going to help that much. And the second reason is that we trained on a training set. This is reporting the dev set BLEU, but by training on the training set, so it's not that comparable to MERT, which is trained on the dev set. >>: Don't you think that there's potentially some difference because you've introduced a sort of loss? You're trying to optimize this single derivation. >> Liang Huang: Zero, one loss. >>: As often as possible. >> Liang Huang: Not single derivation, but single reference. >>: That's right, that's right. Any derivation in the class that leads to that reference. >> Liang Huang: Yes, but zero, one loss on that ->>: But you might be better off getting this one word wrong so that you can get everything right on a subsequent sentence. >> Liang Huang: That's right. That's exactly right. >>: You've lost this original BLEU by picking perceptron instead. Do you think there's some difference because of that loss? >> Liang Huang: Totally, totally. For simplicity, we just do zero, one loss, because you have to get exactly -- you cannot even be one word wrong. But otherwise, you have to have some sentence-level BLEU, and we don't have anything like that. We're just very clean. Right now, my postdoc is trying to work on something closer to that direction that is trying to say what if the sentence is not reachable, first of all? Because here, you assume every sentence is reachable, so you are restricted to the small subset, and what if a lot of them is not reachable? Then you just try to be as close as possible to the reference, but you don't want it to be exactly correct. >>: Going back to [Chan's] question at the beginning, by picking sentences that are reachable, have you selected a biased subset? >> Liang Huang: I think we do. >>: It would be interesting to see what your BLEU score is on the reachable dev sentences versus the full reachable set, because maybe you're doing really, really well on those, but you're having issues with the ones that are less ->> Liang Huang: Yes. My hypothesis is that for short sentences you actually got even more improvement, but for long sentences it doesn't have much, because it doesn't have much signal from longer sentences. We're highly biased towards easy ones, but that's unfortunate. Another student of mine is trying it on [HIRO], where the reachability is very high. Because [HIRO], you can do all kinds of longdistance reorderings and reachability can be more than 80% of the training data. Okay, anyways, so these are the individual contributions of -- or cumulative contributions of features, and these are comparing with MERT and PRO, so this is our max-violation, and MERT is not very stable. It jumps. And PRO is more stable and it gets a little bit better with medium-scale features. And the final result on big data, it took 47 hours? Wait, I think it's 47 hours is like 15 iterations, but the peak arrives around 35 hours, I think. But, anyways, it finished within two days, so it's not that bad, but we cannot afford to run it for a week, I think, 23 million features. So if you just have MERT -- so we have two systems. Cubit is very similar to Moses, my own system in Python. With 11 features, we got roughly the same on dev and test. With PRO, we got slightly better. PRO is slightly better, but with PRO, on more features, like mediumscale features, we got a lot better. But that's 3,000 features is very hard to engineer, so you have to be very careful not to be too specific, not to be too sparse, and I think to hand-engineer this feature set is extremely hard and not a general approach, but we don't engineer features at all. We just throw in all kinds of features. We don't even do any feature selection or whatever. But PRO doesn't do well with a larger amount of features. That's kind of well known. >>: So why not run PRO on the entire training set? >> Liang Huang: That's too slow. Nobody has actually reported that, running PRO. >>: They have, but it took quite a lot of time. >> Liang Huang: Okay, does it work well? >>: There wasn't a lot of improvement, less than one BLEU point, as I recall. One of Stefan Riezler's students and Chris [indiscernible]. >> Liang Huang: Okay, so they did do something like online style of PRO, the perceptron style of PRO. It's not real PRO, but perceptron-ized PRO or something like that. I think I know the paper that you mentioned. >>: But again, it wasn't my intent [indiscernible]. >> Liang Huang: Right. But this is all trained on dev. Dev, it would have quickly overfit. You would imagine that. And max-violation on the training set, 23 million features, you got 2.3 improvement on dev and two points improvement on test over MERT. That's considered a lot. Okay, then sorry, questions? If you move on to Spanish, Spanish, you only have one reference in the standard data sets, but our improvements are, roughly speaking, one point, 1.3 and 1.1, and as a kind of a common wisdom in our field, if you have one BLEU point improvement in one reference BLEU, it's roughly speaking equivalent to two points of improvement in four reference BLEUs. It's not exact, but it's roughly speaking. So our results are consistent with the Chinese improvement and two points of BLEU, but just because the reachability ratio is a lot higher for Spanish, so we can use a lot more percentage of data for Spanish. Okay, so to conclude, I presented a very simple, very clean method. It doesn't use any sentence-level BLEU, like hope or fear or loss-augmented decoding, anything like that. There's no loss. There's just zero, one loss, as Chris said. Very simple, and scaled to a large portion of the training set and able to incorporate millions of features, no need to define anything else, no learning rate or parameters to tune. It's just perceptrons at a very constant learning rate of one. And no initialization parameters. It's always zero to start with and a lot of improvements on BLEU than MERT and PRO. And the three most important ingredients that made it work are, first of all, most importantly, learning to accommodate search errors, violation fixed in perceptron is designed to do that, and max-violation works the best. And then to handle latent variable, I used forced decoding lattice, which previous people argued it's not a good idea, because that's too rigid, and some forced decoding derivations are kind of using bad rules to be lucky, and when you're lucky, you got the exact reference translation, but you used wrong rules. But personally, I argue that why his work doesn't work out is because he can sometimes get a good reference translation with bad rules. I argue that's hard to argue, because you cannot really tell what derivations get which derivations wrong. As long as they produce the reference, I think they are okay. So you can only use the model to choose which one to update towards. And the reason why his work doesn't work out is all because of search errors, so that's it. These two are his curves. The most important thing is, if you just use standard perceptron, then because it doesn't deal with search errors, then you just have very bad performance, because most of your updates are wrong, actually, without even noticing that. Okay, that's the most important take-home message. And our learning framework works the best when your search has a lot of search errors, so if your search is mostly correct, like in tagging, you don't need to bother using our method. You can just use perceptron, standard perceptron. But if your search is so hard and it makes tons of search errors, then you have to use our method. Otherwise, it's just too bad. Okay, then also we have parallelized perceptron to scale it up to big data. And the roadmap again, latent variable perceptron with inexact search, and we hope it's a very general -- it's become a very general technique and replacing EM largely. And questions. >>: So I thought I mentioned this earlier, but your last slide maybe had it [indiscernible]. Within the lattice, within the forced decoding lattice, you have a longer different derivation. >> Liang Huang: That's right. >>: Which one are you updating towards? >> Liang Huang: If you're updating towards step five, you choose the highest-scoring one as step five. If you update at step three, you update towards the highest-scoring one at step three. So it could be different ones. So this one is different from this one. >>: So towards the best derivation? >> Liang Huang: Up to that point. Up to that point. >>: The highest-scoring derivation up to that point. >> Liang Huang: The highest-scoring derivation, prefix derivation up to that point. >>: Okay, and so you're not worried about this bad phrase thing? >> Liang Huang: Yes, I just don't bother, because I cannot tell. I cannot just say, hey, this part of the derivation, you don't use that. There's no way that you can select. >>: So my other question has to do with -- so you made a strong point about search error and ->> Liang Huang: Yes. >>: So then I'm a little surprised about the fact that it's doing just as well in Spanish as compared to Chinese, because some of the other data you showed about completion statistics, etc., would seem to indicate that search error is much less of a problem in Spanish. >> Liang Huang: No, Spanish has the same thing. >>: Well, if it's mostly monotone, then you set a reasonably medium-sized distortion limit of four or five, your search bases come down very drastically, without much impact on reachability at all. So wouldn't that imply that the search spaces is much smaller? The useful ->> Liang Huang: It is useful. The interesting search space is much smaller. I think that's correct. >>: So search error is not as much of an issue. >> Liang Huang: Maybe, maybe. We didn't do an analysis on Spanish. We didn't draw these curves on Spanish, actually, but I think we tried to stand it up. It just doesn't work well, either, but maybe the difference is not that big and doesn't go down that much. But you are probably right, that the interesting part of this search base is smaller on the Spanish side. So if your language is mostly -- if language is totally monotone, then the complexity is actually linear time. >>: And I think people who have previously asserted that search error is not a big deal have been working with things like English and French and things like that. >> Liang Huang: Yes, that's right. Probably the case. >>: So I have one more question about the data bias issue. So for anybody, have you tried using MERT, but only on a data set that had only reachable? >> Liang Huang: Yes, that's a good point. That's a good point. So MERT is usually trained on a dev that has four references, because on test sets, you also have four references. I don't know if people have trained MERT on a training set or were part of the training set. I don't know. I think one of my students maybe have tried training MERT on the reachable subset and just to do a fair comparison. But it doesn't - because the domain between dev and test are usually very similar, but on a training set, the domain is often pretty far away from dev and test. We do have a disadvantage by training on a training set. We don't use dev set at all, except for telling us when to stop, just like preventing [indiscernible]. Most people use dev set in a much more interesting way, but I think from a machine learning point of view, dev set is supposed to just be held out. You shouldn’t tune your parameters on dev set. For some reason , most of MT research has been training on the dev set, for scalability issues, I think. >>: This other potential baseline, in some systems, we actually ship a discriminative model that's trained on millions of features, but it's trained in a less-sophisticated way, right? Like, what we do is we look at each component of a rule and we try to optimize the likelihood of the correct translation from the training data, built a large-scale discriminative model, just optimizing log likelihood, and then throw that in as an additional feature? >> Liang Huang: A lot of people tried that, yes. >>: I mean, it helps, right? >> Liang Huang: Yes, it does help. >>: And of course it's much less satisfying, but in Spanish, do you think must of your gain is coming from just learning these simple lexical features? >> Liang Huang: It might be. I should do more like join these curves on Spanish, but I guess the shapes will be similar. Just the difference will be kind of shrunk down on Spanish, I guess, but I don't know. But my guess is if you just use standard perceptron, it's not going to have that much improvement, for sure. Yes, it will be interesting to draw this curve, especially, the number of invalid updates on Spanish. It might be very different. It's a very good question. I should look into that. >>: You said you tried it on Japanese? >> Liang Huang: The reachability ratio is too low, like less than 10%. I just cannot use much data. That's unfortunate, because the distortion is just huge in Japanese. Okay, thank you very much. >> Michael Auli: Let's just thank our speaker. >> Liang Huang: Thank you.