1 >> Bill Dolan: Hi, so welcome. And our speaker today is Colin Cherry, who really, really needs no introduction. I think everybody here knows him incredibly well from his years here at Microsoft. Colin got his Ph.D. from university of Alberta in Edmonton, and working with Dekang Lin and later joined us here as a researcher for a couple years. How long were you here? >> Colin Cherry: Two. >> Bill Dolan: Two years. We all miss him desperately and miss his brilliance and are happy to him back here briefly to talk about -- very briefly, given his chaotic trip, to talk about his work he's been doing on parsing lately. With no further ado, Colin. >> Colin Cherry: Okay. Thank very much. I'm very pleased to be here. So I'm going to talk to you about applying some filtering techniques to dependency parsing. This is joint work with Shane Bergsma, another student of Dekang Lin's, who's now at Hopkins doing a post-doc there. So I think most people in the room remember that there was another talk that I came and gave at MSR once upon a time for my job interview. And so in that case, I had planned again to fly in and have a nice, relaxing day beforehand and then drive in in the morning to do my talk. But my flight was cancelled due to Seattle weather and I came and kind of poked fun at you guys, because at the same time we had had much more atrocious weather in Edmonton and the airport was still running. So that was my out, my winter outfit which I had taken the night when I missed -- when I was supposed to be flying. So this time around, due mostly to the fact that U.S. customs hates me, I wound up staying the night in Calgary with my brother, who I was, like, I need to take a funny picture. I got this thing, when I miss my flight, and so we decided since I was in Canada's cow town, I would get in my cowboy get-up and kind of look a little perturbed there. So then I was supposed to come in, and I'd say the moral of this story, the travelism you should take away is book afternoon talks, because you'll always wind up flying in in the morning. My morning flight was cancelled due to weather and I ended up flying in and missing one of -- well, missing most of the day yesterday. So the real moral of the story is, never travel. So I'm going to talk to you today 2 about two projects involving dependency parsing, where I'm basically going to look first of all at just kind of working as hard as we can at speeding it up by taking this idea of kind of filtering out the head-modifier pairs that could exist in a sentence before you start parsing it. And then I'm going to, for a second part of the talk, I'm going to look at it from a bit more of a machine learning perspective, and I'm going to kind of ask if we do have a bunch of filters that are passing over the same sentence and kind of overlapping with each other, is there a way that we can get them to train jointly in order to improve their performance. So I'm going to start with the first topic. So talking about just filtering dependency parsing, basically our goal here is to speed up graph-based dependency parsing by removing implausible head modifier pairs before parsing begins. And this was motivated by a number of things. I've had a few people tell me, like, oh, the reason we don't use a graph-based parser is because feature extraction kills us, and we were actually trying to get a bunch of interesting semantic features for graph-based parser and again we went to our lax Cal semantics guys and said can you get us this for all these word pairs, and they said no. We're not going to do that. Sorry. They cut it down by some significant amount, and maybe we'll go and get you those word pairs, get you some semantic information for them. So we never went back to them because we had so much fun filtering out the word pairs after that. But we will soon. So the method is there's going to be a process, classifier that's going to actually try to make a point wise decisions about a tree in a high precision manner to we're not going to actually harm parsing accuracy. The result is you could remove 78% of the arcs you could remove in a tree before parsing even begins and you only lose less than 1% of the ones that you would have liked to recover. So we're going to show speed-up results on two dependency parsers. So dependency parsing is probably familiar to most people here in the room. But you get an input sentence, and traditionally, it's part of speech tag sequence, and then you wind up with a tree structure here where it's going to be individual connections between words that are going to kind of indicate dependencies or head modifier relationships. So ate is described by the fact that it's Bob that's doing the eating. He's eating pizza. And he's doing it with his fork. And so it's kind of important to have the tree, because that way you get to know he's eating the pizza with a fork, like he's eating with a fork as opposed to he's having a pizza with fork, like you have a pizza with pepperoni. 3 So this is Shane's slide. My slide might have had more about the semantic features that we were going to try to get later on, but basically, our motivation was there were just a lot of these word pairs to consider, and also it's motivated by the fact that of the two competing formalisms in dependency parsing, graph-based and transition based, a graph-based does tend to be the slower option. And it does tend to make orthogonal errors. So transition based is very fast. But if you can get graph-based up to the same speed without sacrificing any of its accuracy, then you could imagine doing some sort of combination. And furthermore, you could imagine doing a lot more interesting things if you had to consider -- if the whole process was faster and you didn't have to consider such a big problem when parsing. So to kind of further motivate this, we'll just talk about how graph-based dependency parsing works for a second. So the paper was written with arcs. There's a Coling paper on this topic, and we set arcs every time we made a connection between a head and a modifier. I hate saying that word. And I hate typing it, it turns out too. So I'm going to, like, interchangeably say arcs, links, edges. There's only one thing we have to worry about here. It's a line from a word to a word. It's a directed link so I really apologize. It actually switches back and forth a few times during the talk. And I'll probably not say the word that's actually on the slide most of the time. But you can see here that the score of a tree, so you're going to take the tree that maximizes some scoring function over all possible trees, and then the score of the tree is going to be the score of all of the edges or links or arcs in the tree where that score is -- where the link-wise score is calculated by this dot product with the weight vector multiplied by some features, extracted by the head, the modifier, in context. So the S stands for the sentence. And then that argmax can be computed efficiently with, say, the minimum spanning tree algorithm, which is very fast, or a projective dependency parser, which is also very fast. And the kind of hidden expense here is that all versions need to kind of compute these inner scores here for every single edge or arc or whatever in the tree that you're going to be evaluating. So every possible connection, every possible directed connection between words needs to be scored before parsing begins. And then the parser actually just flies over it after that feature extraction step is done. 4 So even though I've written, like, the scoring process is technically only N squared, this F factor is actually very large. And, in fact, it's normally larger than N. There's normally more than N features. And this extraction step is a little slow, and then the dot product is fast, but it adds up. So just to kind of drive that home, say we're considering the link between ate and with. In that setting, then we are going to consider all sorts of things, like just the word ate on its own, the word with on its own, all sorts of features of those, plus we're going to look the two words together, plus we're going to look at everything that happened in between them. So I've written down 20 features here. On average, we use 60 in a high accuracy one. If you really want to push it to the state of the art in accuracy, you're going to get up around more than that 120, maybe even 200 using some cluster based features that have been advocated by Terry Koo, for example and then you can join them with direction and distance. So it really does add up to a lot of things just to know how likely it is to kind of connect up ate and with. And furthermore, we're going to ask it to build these same feature vector for, say, the to his. So maybe the and his are in some sort of syntactic relationship with each other. Well, there's a lot of reasons why you wouldn't want to bother with this. The isn't usually the head of anything. And this pause is usually has its head, if it's going to have one -- well, it always has a head, but it's going to be on its right, usually, not on its left. So we propose this three stages of filtering, where every filter is going to feed into the next one. So each one is going to be progressively slower, but it's going to have to do less and less work, because we're going to be knocking out links left, right and center. Each stage is a supervised SVM classifier. If you're working with Shane Bergsma, which I recommend, he's a great guy to work with, you want to work with SVMs. You can kind of just do magic with them. He's very effective at getting them to do what he wants. And then we extract our training data from decisions in the tree bank. So we can kind of look at the tree bank to get all of these training pairs for the SVM. The important thing that we do here is the SVM is biased at every stage to be incredibly high precision. We optimize this, what we call a J-parameter, which is a per class cost factor in order to make sure we're getting -- making almost no 5 mistakes that will eliminate a true link from the tree. The very first classifier that we built was kind of, we just kind of thought of a bunch of easy decisions we could make about a tree quickly, and then we said, every decision is going to have exactly one feature or maybe two, and those are going to be the part of speech tags. Of the two words involved or maybe only the one word involved. The result is we get this table of rules, because the minute the SVM latches on to a part of speech tag at all and gives it any weight, we can just say the minute we see that tag, we're just going to knock out, knock out the links involved. So we wind up with a rule that just says if the quotes, the comma, the brackets, it's just not ahead. Just don't worry about it. Furthermore, because we did this with the SVM and with this high precision bias, it's only going to pull out rules for us. We didn't make this table ourselves. It's only going to pull out rules that are structural zeroes. It's going to pull out things that it's seen frequently enough that it would have expected to have these events happened statistically, but they didn't. So we can now say the head is never to the left for these symbols. Head is never to the right for these ones. So you can see some of the rules are somewhat useless. Head is never to the right for the period. It's only, I mean, yes, you're always right. But there's no words to the right anyway. But not a root. You can just kind of knock out a bunch of decisions right here and you can just fly over these, because year going to have to eventually consider these pairs anyway. So just the end squared step, but just looking at the two part of speech tags is super fast. So it doesn't take any time at all. >>: Assuming that the part of speech tagging is correct. >> Colin Cherry: Yes. Well no, but we give it -- you can train it with noisy part of speech tags. So the tags that are as noisy as what you're going to see at test time. >>: [inaudible]. >> Colin Cherry: Actually, this table wasn't done with that, but it gets much better if you do it with that. We've done it since then. It gets a little bigger too, actually. But that's kind of the least interesting stage of things. 6 The most interesting stage happens in the middle. This is the linear classifier, and this is going to work on one token at a time. So we're going to make as many changes to the -- and many restrictions on the tree as we can, and we're going to be allowed using rich, interesting feature vectors, but we have to only look at a single token at a time. That way, we can still fly through the sentence. If we looked at two tokens at a time, we'd be pretty much back in the same situation we were in when we started. We can make a situation again that a word is never a head. That it's a leaf in the tree. This knocks out N possible links the minute you make that decision. Because anything coming out of that word gets knocked out. So it's N minus one but it's still good. Head is on the left or right. Then we kind of went crazy with the left to right idea. Kind of said left to right within five, because that's going to be true a lot of time, because the links tend to be short and immediately left to right, that's not true a lot, but man, if we can get it, then again it knocks out a bunch of links. Same thing. The root, the root is notoriously hard to pick out in dependency parsing, but there's some cases where it's obvious, about ten percent, you're safely able to say this has to be the root of the sentence. And if you do, you can rely on projectivety to set up a barrier to the sentence. You say wherever that root is, any arc that's going to cross over it, we can cut that out. If you could do all of these decisions perfectly, like if you got to see a decision function that was by making this decision, I can -- I'm only going to do it if I don't hurt the tree in any way, shape or form, you could filter 90% of the links that you would be evaluating normally before parsing begins. We're not going to get to 90, but we'll get -- anyway, you'll see the results. So and then kind of the idea that makes this whole thing pop, because feature extraction is expensive, is that we use the same feature vector for each of the eight classifiers. So not a root has to use the same feature as not a head, has to use the same for head to the left. You build it once you, multiply it with eight feature -- eight weight vectors, because dot products fast, and you're in good shape. And then the features are kind of boring. It's just look at the tag, look at the word, look at the tag and words nearby. Look at, you know, are you near the end of the sentence, the beginning of the sentence, things like that. You can get even better results, now we have eight different things all trained independently for high precision. If you kind of take your top three or four candidates for each one of those filters in terms of high precision filters, but 7 they all kind of have different precision recall trade-offs, if you take them and then blow out all 6 thousand, 60 thousand combinations on your development set of, like, parameter values times all the eight filters, time four parameter values each, for example, would be about 60,000. Eight times three is about 6,000 each. If you do that, then you'll get better results still, because you'll find out that, oh, I thought this filter was making lots of mistakes, but it turns out that it was mistakes the other ones were making anyway or something like that. So you can kind of trade them off against each other. So this is actually going to be the slide where the next portion of the talk is going to begin. We're going to take that idea even further in about ten minutes. Then finally, the last stage of filtering is this quadratic stage of filtering. We've now had rule based where the rules were selected by an SVM, then linear, where we fly over the tokens, then quadratic, which is size of F times N scared and its complexity. It's this whole step we were trying to avoid in the first place. So why bother? Well, we had this discussion back and forth a few times. In the end, empirically, it won out that it's still worth doing. Because it can be a light preprocessing step. You can use less features than you would have used. And in particular, there's these troublesome between features which are somewhat powerful, but you include all of them all the time, then it actually, it gets quite expensive. So you can include not all of those between tags at the filtering stage and still make your high precision decisions and pull out the heavy feature set for the, when you finally get to the, like the cream of the crop, whatever remains to be evaluated at actual parse time. And furthermore, if you wanted to do this correctly, probably all of the features you pull out for the filler should just be cashed to be used later on during the parsing process. We didn't do that here. >>: So just curious. Did you continue on to treat this as a course defined process. Like you can use the course in this outside course in your research. >> Colin Cherry: No, I didn't think of that. That would have been smart. So we kind of avoided throughout this work stuff that was in any way algorithm specific for the parsing. We kind of wanted everything to just plug right in, regardless of what your inference engine of choice was for parsing. Certainly, course defined 8 is conceivable with the projective algorithms, but I would have hard time figuring it out for MST, for example. It's a good excuse. I can pull that one out a lot for a lot of suggestions. I'd be like, oh, but does it work for minimum spanning tree? Oh, sorry. No, but it's a good idea too, the course define idea is good. And that was actually up here in the slide once upon a time as related technologies. So talking about filtering dependency parsing, the first thing that comes to mind for me at least is vine parsing. This is work with Jason Eisner and Noah Smith, where you have this hard cap on arc length. You just say most dependencies are short, we're going to punt on those and we're going to fly over the ones that we can get. Turns out that if you depend on the tags being linked, and their direction, you actually wind up with something that looks a lot like our rules, but with distances appended to them. And that's fast and fairly effective. And then there's been extensions that looks to varying degrees like what we did to this, but nothing has ever been tested on an actual state of the art dependency parser up until this point. Obviously, our linear work, I think, has been heavily inspired by this CFG cell classification work by Roark and Hollingshead, which is this way to speed up your constituency parsers. And another competing way to do this would be, of course, to find, of course, but it can get inference specific. So our experiments is we took the standard splits of the English tree bank. We tagged them with the Stanford part of speech tagger. The results I'm going to show here are all, everything is trained on gold tags, and then tested on noisy tags. Turns out that everything improves a little if you jackknife and train on noisy tags the whole way through. For just evaluating the filters, we're going to present coverage versus reduction. Coverage is like your recall of true links, how many true links were you able to come back with and have available for the parser at parse time. Reduction is how much effort you've saved it in scoring things. And then at the very end, we'll evaluate, of course, the accuracy of the parser. So just looking at the filters, you can see tag-vine actually does pretty well. You can, if you pick a good cut-off, they have a very simple algorithm where you just kind of slowly reduce the distance for every possible tag pair. You always make 9 the step that kills your recall the least and then you cut it off at whatever your desired level of recall is. And if you do that, you can actually get a 44% reduction while maintaining 99.6. If I had to implement what we did in an afternoon and notch, like, over the course of a research project, you could do a lot worse than this vine parser proposed by Eisner and Smith. Of course, you can do better, which is why I'm here talking to you. So the rule-based system here from the rules from the SVM, you can, knocking out 1 percent of true links, you can reduce to 75% of the work you had before. You can knock out 26% of the links. Then sacrificing a little more recall with the linear, you can get to 50% of the effort. And then finally, you can knock out 78% with this quadratic filter. So it's actually doing a lot of the work of parsing at this stage. These are cumulative speeds here, so each one of these runs on top of the other. So you can see it's one second for the rules, then it's another seven added in and then another 16 added in here. So this is winding up to be somewhat expensive. or not this last step is really worth it. It depends on your parser, whether So we test this on the Ryan McDonald's MST parser, which can be downloaded from the web and modified fairly easily. So we modified it to admit filters. And then there's also the, I guess we can all it the NRC parser, but for the purpose of the paper, it was called DepPerceptron. It's always made me laugh. It's like depth perception. I don't know. And that's trained with an average perceptron, and it just kind of uses a greatest hits list of the features of all the people who have kind of come before. So I'm, I really, I tried all sorts of visualizations of time and accuracy simultaneously, and I couldn't come up with one. So here's a big slide of numbers. Here's the only numbers I want you to look at. MST-2 is probably be what people are going to use, maybe they'll use MST-1. So this is second or first order parsing. MST-2 is the one where you kind of take a big speed hit for a little more accuracy, so I thought that would be an interesting case to visit here. And you can see here, unfortunately, the times you kind of have to do this weird kind of collect up all the times here and add them here. So I'll just do it for you and tell you that MST-2 goes from 12 sentences a second to 23 under these filters. 10 And furthermore, mine was -- my system was kind of developed with the filters in tandem so I had no idea it was so slow. But we took them out and then it was crazy expensive, it turned out. But you can see here that you have this bigger impact if you -- if you're kind of really pushing on the second order features a lot. So my system is actually missing a few key decompositions where you only use m-m pairs as opposed to h-m-m triples. So this is head and two children is what second order allows you to look at instead of the head and one child. I was always looking at all three of them at once. Turns out MST when it can only looks be at two children at a time. That's why you only see a small improvement here but a large one for here. But if you had a system that was working on those triples all the time, these filters would be golden. And certainly, Terry Koo's cluster based features do fit that description. They have a lot of stuff that works on the triples. And finally the vine, it kind of -- it's good, but we do see this slight accuracy hit a little early happening here. So it's kind of telling you that even though it's filtering less, it does matter what the character of your errors are, whether or not -- at what point you're kind of going to introduce that next stage of error. But I'm sure all of these errors are well within statistical significance of each other. I'm sure it's all virtually the same. So in conclusion, presented linear and quadratic time filtering techniques, which can lead to speed-ups even in carefully optimized dependency parsers. There's a negligible loss in accuracy that you pay for the speed and the code for doing the filtering, not the parsing, is available on the web, and it's all trained up on the treebank already. All right. Take a little breather, a little drink. So next topic. If we find ourselves in a situation where we have a number of filters that are overlapping with each other, can we train them together to improve them? So the goal is to improve performance with filters when they overlap. The method is going to be an old friend, the latent SVM is going to rear its head here again. It's work that Chris and I looked at, again for a kind of parsing related project while I was here. This is -- anyway, it will be interesting how it comes up. I think it's a better fit for this problem. And then the result is going to be, first and foremost, it's going to be a principled method to optimize filter combinations. Rather than just blowing out a bunch of kind of good looking candidate sets on a development and like kind of doing all 60 11 thousand combinations and seeing which one works well, this is going to be like an actual learning method that's actually going to be doing something regularized and reasonable with this joint setting of a bunch of filters interacting with each other. And I am going to show some improvements from weighted F measure for filter quality. So there's going to be a few motivating observations here. The very first one is that what we are doing with the linear filters, so this is the setting of this talk is all I'm just going to zoom right in on the linear filters from the previous talk. What we are doing with the linear filters is we are being a token wise classifier to speed up parsing. And so we walked over all the tokens and we made this call about whether or not it was a head or whether it was not a head. So Bob, not a head. Ate is a head. So it's false for not a head. Not a head, head, head, not a head, so on. So there's already this interesting test, train test criterion mismatch, because we're training on tokens. But at the end of the day, we're going to evaluate on links. We're going to be counting how much links that we correctly -- that how many correct links survived and how many wrong ones got filtered. So there's this mismatch happening here. So that's a little worrisome. Kind of the opportunity light is kind of firing to some extent. There's a chance to make a development here. Furthermore, we've got this overlapping tech thing happening. So here we have the, and we want to get rid of the link from the to his. And there's four ways to do it. If we have, if we correctly classify this as not a head, the link is gone. We never have to worry about it again. Furthermore, any of the head to rights, any of the flavors of that are going to eliminate this link, because it's going to mean the link is not on the left. And it happens to be that all three of them are true in this case. So is there any way we can leverage this redundancy and this setup in order to improve accuracy? So the evidence for number two is earlier in this talk, I gave this slide where I kind of talked about this process where you blow everything out on the development said and try a bunch of different combinations and you get be a improvement. >>: Is this some kind of meta technique where you set evidence for your thesis by setting one of your earlier slides? Daring technique. >> Colin Cherry: Well, I mean, if you bought the earlier talk, then you kind of 12 have to buy into the second talk. So, I mean, you're already sitting here. [laughter]. >> Colin Cherry: Honestly, I just wanted to see one of my slides in a nice little frame. It's all I was really. >>: [inaudible]. >> Colin Cherry: Yeah, that would be great. No, unfortunately. So the process of this kind of joint hyperparameter optimization that we did earlier is that it does introduce link accuracy as a criterion, because that's how we evaluate the hyperparameter combinations. Although it's kind of late in the game. We're already trained a bunch of classifiers. Now we're just kind of picking between hyperparameter choices. And it does account for all the filters at once. So it kind of achieves those two problems. But we're going to try it to do it better. Mostly because this process is clumsy. And it's easy to say, but then when you sit down to do it, you say oh, who is in my candidate set and how many things should I have for the candidate set and I guess I need to be able to do it really quick to kind of allow myself to check all exponential number of combinations. So it's kind of ugly. So we'll make it less ugly. The other reason I'm interested in this is it could benefit other people. In particular, other filters. So we're not the only ones doing this kind of work in Hollingshead's style of filtering. Obviously, there's also Roark and Hollingshead. They do constituency parsing. I can't remember what the speed-up was, but we're using it in the parser here. So it's three to four times speedup. And it's from this idea that you can tag things as either not beginning or not ending a multiword constituent. Same time, there is this interesting paper at NACL this year, where they do the same thing, but for multiword translation regions and they're asking whether or not a token can begin or end a cohesive translation region. Whether or not you can apply that sort of idea depends on what sort of decoder you're using at translation time, but if you're using an ITG decoder, it kind of plugs right right away, and they were showing a one-point BLEU increase on a strong system. So there's speed, there's accuracy. And they're both have two filters making overlapping decisions again. So there might be an opportunity for improvement there. 13 Okay. So that's big enough to read. Good. The intuition of what I'm going to talk about for the rest of this talk is that we're going to try to classify links with token features. So we're going to build a classifier over links, but it still is only ever going to look at one token from that link at a time. It's never going to look at both at once. That's going to maintain hopefully these three desiderata. One, we're going to be able to train on link accuracy rather than token accuracy. Two, this is kind of the big one, we need to be able to retain the computational advantages of flying over those tokens at test time. So we can take as long as we want during training. I've just declared it. You're not allowed to call me on it later. Take as long as we want during training. But at test time, we want to be able to fly again, just look at one token at a time. And finally, we want to train these filters so that they're all in there at once and they all trade off against each other. So let's look at the links from a token perspective. So here's a link we want to keep from ate to with. That wants to survive. What could go wrong? What could kind of ruin our day here if we're doing filtering? Well, actually, a lot could. You could declare ate as not a head. You could declare the or pizza as the root and kind of set up a barrier, or you could declare with as any of the flavors of head to right. If you do that, the link is not going to happen. So in order for this link to happen, you need this conjunction of things to happen. You need this and this and this and this and this. And this just describes the kind of comparison that each classifier is doing. It's always comparing a weight vector to a single feature vector of that token. If you look at a link that we want to eliminate, so a link that should be filtered according to our training set, then we get this different kind of relationship. There's a bunch of things that could go right here, rather than a bunch of things that could go wrong, and so we wind up with this or relationship. So if this guy -- if the is labeled as not a head, we're good and we're done. If this, his, is labeled as a head to right in any of the flavors, then again, we're done. And so we wind up down here, with an or relationship between all of these things. So any one of these can hold ->>: [indiscernible]. 14 >> Colin Cherry: Yeah, we're going to leave out the ones that will break other links, though, because we're kind of driving the learning process here. So there's no sense in encouraging it to -- you're right, that would get us here, but it would also get someone else that we don't want. But yes, the roots would also break this particular link. And actually, the technology kind of works if you ignore that. But for now, let's assume we're just going to give things that are kind of safe decisions. So these are kind of our constraints on our learner if we want to maintain those links. We want this conjunction to hold and we want this disjunction to hold. And that kind of asymmetry is exactly what I decided I didn't like about latent SVMs. Now I love it. Because it's what I need in this situation. This sort of and relationship and or relationship here, where the and always happens to be with less than symbols and the or always happens to be with greater than symbols, can be summed up with a max. So if we take the max over all filters that could cut out this link, and say that the max overall the scores has to be less than zero, that's the same thing as saying that every one of them has to be less than zero. Furthermore, if we look at all the filters that can cut out a link saying the max has to be greater than zero, that's the same as saying at least one has to be greater than zero and we don't care if there's more than one that's greater than zero. So this sort of classifier, with an inner max that's making your decision, is exactly what the latent SVM was designed to do. And so now we have a somewhat proven technology that we can just kind of apply to this problem right away. And so latent SVMs have been used originally for latent part models in image recognition and we had latent parse trees in sentence recognition, work done with me and Chris and latent alignments most recently with some of Dan Roth's students in paraphrase recognition, also entailment recognition and cognate recognition. So people are using this and it works. The thing that we're going to do here is that our latent structure is actually going to be kind of dull. It's just picking a filter. So it's just picking one of these filters that could hold in this or. Rather than building a whole tree to justify some decision, it's just kind of saying, I'm just going do my max over these kind of -- it's more of a multiclass sort of latent structure, where it's discrete over filter choice. So here's my math slide. I'll keep it brief. But main take-away point is it's exactly designed to work with when you're taking the sign of a max of dot products as your classifier, that is when you are -- that is what a latent SVM builds. That 15 happens to be exactly what we need here. Here's your objective. And you can see it's just a normal hinge loss with an inner max and that inner max breaks our convexity and kind of introduces the need for a more complicated algorithms. You wind up with an EM-like hill climb, which I'll talk about now. So what it winds up happening is you kind of start with filter selection, where for every link that needs to be filtered, you pick exactly one filter to handle it. That leads to this. This stands in for that choice of one filter for every link in our training set that we know needs to be filtered. Then you can strain a structural SVM on that problem, and then you get, then you get some weights. Then you can use those weights in order to pick new filters for the ones that need to be filtered. So this distinction that I'm making where I say filters for the links that need to be filtered, so the links that don't exactly exist in the tree at the end of the day, that's very important. We need to always maintain this conjunction. But fortunately, SVMs handle conjunction no problem. SVM handles these or, or these and relationships with no problem and that's exactly how a structural SVM works. If you're training a SVM, like Ben Tasker's max margin alignment system, so if you're doing something like that, then you're saying this alignment needs to score higher than this one that's wrong and this one that's wrong and this one that's wrong. And that's just a giant and. So we already know how to do the giant and. It's the or that SVMs can't handle. But if we pick one filter from the list, that eliminates the or, and we're still kind of satisfying the disjunction, because we're training for one of the things on the list. We're not just training for any of them. The problem is the one that we pick might not be optimal. That's why we iterate. We iterate to kind of try to -- that's why this loop is here, because you can smooth out some bumps in the learning surface by trying to pick the items from the disjunctions that provide less resistance toward learning a good system. If you're used to working with latent variables while learning, like say in a maximum likelihood model or something, another way to look at this is normally what you do when you're doing learning with latent variables is you have one component of your objective that looks at the completely unconstrained problem, and the other where you have to constrain it so that you know the right answer, but you still do some processing over your latent variables, given the right answer. 16 Here, we're doing the exact same thing. Just happens to be that the answers are filter or keep, you know, either filter this link or keep it. And when we know the answer is keep, we don't have to do any work. We know that the right answer is there should be no filter. Like we don't have to iterate over a bunch of choices. If the right answer is keep, then no filter is the right filter. So that's another way of looking at it. And that's actually how I wound up implementing it. So just an example, iteration, you always have the conjunction in here, in your SVM, constraining all of these things to be true, but maybe you just pick head to right one as your first time. Maybe you pick it randomly from that disjunction that we had earlier. And then later on, when you have better weights, you find out that it's 'em safer and easier to declare eight is not a -- sorry, that the is not a head. This guy survived from copy and paste errors. So let's hit that desiderata again and see how we're doing. This filter selection step is doing exactly the trade-off that we wanted. A particular filter only sees a token as a training example if all of the links coming out of or going into that filter -- coming out of or going into that token are not being filtered by someone else. If it happens to be that this token is covered completely, then it's just going to kind of disappear from the objective for other filters, for example. So if there's a bunch of strong filters that are doing a good job, then the weak filters can kind of concentrate on the holes, or vice versa. Furthermore, we incur one instance of hinge-loss per link. So that means if a token is being handled incorrectly, and say it's cutting out 30 links from the tree, because we're getting it wrong in the training set, that would be -- that would have to be a not a head decision, where that head had 30 children. But okay, say we're losing five links. That's more reasonable. You can have five. A word can have five children. So you would actually incur every time you looked at each of those five different links, you would incur the same mistake. You'd say oh, well, I'm filtering it because that not a head decision. So you actually, you kind of have link accuracy right in there, like a link hinge loss, right in there in your objective and then your learned weights can still be applied to tokens at test time. It's still fast. The cons that we've picked up along the way is that the training procedure is not con vex, and the training set is large. It's over links, not tokens now. So we kind of asked for that going in, but turns out that there's a lot more links than 17 tokens. In fact, there's 20 million links, there's only 700,000 tokens. need a large-scale SVM in order to handle this. So you So we wound up actually rolling our own. Shortly before I left here actually, I kind of discovered primal gradient SVM and decided that was my new favorite thing so this seemed like a good place to test it out. So it's fast, it doesn't use a lot of memory and so we can get a decent answer, even with this 20 million problem, in about an hour. We can get a better answer in three and a half or four. So we're in pretty good shape, actually. It's a non-convex problem so you do need to initialize to a good starting point. Fortunately, we have a whole list of starting points from the previous talk so we just took our system. We're not doing any sort of joint optimization here. We just kind of quickly trained each filter independently to be high precision and then used that as like -- we appended all those weights together and called that the new weight for our big, joint system. And that became its starting point. The initial filter decisions are actually pretty good. And then it's also really important in this setting. This is why I kept saying, oh, we have results for jackknife. They're just not in this paper, for jackknife part of speech tagging, because this thing falls apart if we're not doing jackknife part of speech tagging. Because all of the trade-offs between filters are being done on the training set, not on the development set. In the other setting, we were doing a bunch of trade-offs on the development set where the part of speech tags were realistic. Here, we started doing trade-offs on the training set where the part of speech tags are unrealistically good and the system just came back and said always use not a head. Just devote all of your resources to making not a head as good as possible, and don't stress over anything else, and I was like, well, but it doesn't work when you go to development. So we had to jackknife in order so that it knew to do -- that the not a head was less accurate when you didn't have perfect part of speech tags. So jack knife, we used an in-house part of speech tagger to kind of, because -- to retrain a bunch of different times on different segments. >>: [indiscernible]. >> Colin Cherry: Yeah, we retagged the training data with tenfold. is trained on the other nine-tenths. So each tenth 18 So finally, the thing that really slowed us down was learning with costs. In the other problem, we kind of could have just used the cost parameter that was built into lib linear, which is our SVM package. Here, we were rolling our own SVM so we kind of had to figure out how to do costs correctly, kind of from the ground up. And took a little while. Turns out that the decision to go with primal gradient SVM was somewhat suboptimal in that setting because it's actually, it actually becomes very difficult to do class specific cost parameters there. Fortunately, Joseph Turian has started this thing called Meta Optimize. Have you all seen this? It's a good resource where you can kind of go on and ask a question. I've kind of explained my set-up, my issue, and John Langford came on and was like, oh, yeah, I've got a black box solution for costing. It doesn't matter what your learning algorithm is, and you kind of re-weight your examples. Tried it out and it works and it kind of took this thing from not working to working overnight, more or less. So but that's really important, because we need to know that like the learner, because it's doing the trade-offs, it needs to be aware of the cost of getting a true link wrong the whole thing through. Because if it doesn't understand that, then it won't be making the correct trade-offs. So we test in the exact same setting as before. Again, English treebank. Before I kind of just showed you points in this filtering chain that were clearly better than each other. I was like oh, I've given up a little itty bit of accuracy and a huge gain of filtering. So obviously, you're going to buy in. Now, unfortunately, we're working with a machine learning method. We already had a strong baseline that we already published a paper on so we're going to have to look at weighted F measure to kind of know whether or not these trade-offs that we're getting are good at the end of the day. So that's what we're going to look at, in particular a weighted F measure where we weight the recall of true links 25 times and then 50 times more important than, you know, any gains we get by cutting out links. So our two baselines, one, where you independently tune each filter on a development set so each of the filters get trained and we tune their parameters to settings that look good to us on the development set. So what we wind up doing is changing good to us to being, to optimizing either F25 or F50 independently on the development. That makes it a little more well defined, a little more systematic. So we wind up with 16 hyperparameters, each trained two at a time. 19 Then you can jointly tune where you start with the four best candidates for each filter here, or three best and then you try all 6,000 combinations on development. Again optimizing for either F25 or F50. And then we can tune our two hyperparameters. We have the same things, class planed parameter and a regularization parameter. So we can train our two, optimizing on development for F25 or F50 again. And this kind of black box costing that I alluded to earlier needs its own settings, but we didn't optimize those for F measure. We just did ten cost weighted samples of our test set, or of our training set. And that winds up being this kind of four-hour training procedure. So it's not that slow. And here's the results. Just to keep myself honest, because the table looked a lot better earlier, I decided to put in just using the rules. So this is all built on top of using that rule-based filter that we had in the earlier talk. There's no reason to throw that thing away. It's really, really precise and it cuts out 25% of the links. So we might as well start from that always. You can see that it scores very well on F25 already and it scores very well on F50. This is one minus the scores. Lower is better. When you're starting at -- this guy starts at 99.2. When you're starting at 99.2, you need to kind of show one minus to see the differences at all. The take-away point, you can see by glancing at this, that it's better to do something than nothing. Immediately, doing something cuts you down. Doing linear filtering at all automatically cuts you down from three to two here, and from 0.8 to 0.6 or something here. The differences between systems are a little more difficult to spot, but this one you can see it, because it's conveniently distant from the line. So you can see, we are kind of getting our trajectory, and we are improving this hyperparameter, the joint hyperparameter baseline is actually quite a strong baseline. I mean, it has all of our desiderata. It's just a little ugly. So but we are improving on top of it. So -- and you can kind of see here in a little bit more fine grain what's happening, because these F measures don't really tell you the whole picture. Optimizing for F25 is actually a little aggressive. Turns out that that trade-off does prefer, you know, maybe a little bit more filtering and a little less recall than we might be interested at the end of the day. But here are the trade-offs that you kind of get. And you can kind of see our winner here is 99.5 at 67, where you can see that using that quadratic filter which, by 20 the way, you know, is three times more expensive in terms of time, you do get a lot more filtering for the same reduction. But I think this is good. Like if we were willing to accept at 99.5, we were willing to, you know, 99.5 was good for getting four out of the five links out of the picture before we start. Now we can get two out of three at least. Two out of three is not bad, as far as I'm concerned. And that's with the seven second fast system, rather than the 20 second quadratic system. Over here, we're actually kind of by coincidence optimizing for F50, we landed right near our result from the Coling paper, which is nice, because it allows you to kind of see the progression. So these two are actually using the exact same technique, but you can just see the benefit from jackknifing is actually huge from getting the part of speech tags realistic is actually quite a bit larger than the benefit from using the improved learning technique. But there is still an improvement here where we're getting higher coverage and slightly higher filtering. So I'll take it. It's also a heck of a lot less ugly. So what I'm ramming up against now, I'm not considering this work finished. Still trying to kind of get the numbers up a little bit higher. Is that all of the methods suffer a little bit from the fact that at extremely high levels of recall, like when you're arguing between 99.5 and 99.6, in terms of recall, you start stressing over, like, 60 links cut out of the system. And it turns out that 60 links cut out of the system actually happens to be the exact difference between if you projected the number, like the prior probability of linking, given a potential link, the probability of actually maintaining that link in the tree. That's the difference between the development and the test set. The developments that you always, you always kind of get 60 links for free. Because it just happens to have a slightly less high probability of linking. There's fewer links in the development set. It's mostly due to this 119-word sentence that someone put in there. So the solution to my problem may be just knock out the 119-word sentence. But I'm trying to find ways to get sentence length, because you can see that prior probability of making a link changes depends on length of sentence. So I'm trying to find a way to get that into the model. Hopefully, just as a feature. It might have to be in there in some more creative way so that the system is kind of aware of the fact that there's this independence assumption we're making by running it as a classifier that's not really true. I don't know if it's the independence assumption that's getting us more so than the identically distributed assumption, where basically the prior probability of lipping does depend on some measurable quantity, which is sentence length. You 21 always make one where N is the length of your sentence, you always make one and N links. So and then before I try to publish this, I really think I should kind of hit the other big filtering problems out there and see if we get similar levels of improvement. And, in fact, the people who have done these sorts of things have never even done it with a cost factor built in to their learning algorithm. They've always done post hock thresholding of a normally learned system. Our own experiments show us that this thresholding is actually strictly worse than learning with a cost factor for the -- for at least our setting. So it's kind of -- we should be able to make big improvements. If not on the final speeds, at least on the filtering numbers for these two problems as well. So to some up this portion of the talk, proposed principled approach to jointly optimize the number of overlapping filters. And the kind of secret sauce here is that each item selects a filter as a latent variable and have demonstrated improvements on weighted F50 measure on the talk I just gave previously. So this sums up the discussion. Talked about can we filter links? The answer is yes. And then we talked about whether or not the links can be trained together. The answer is yes. And you do see some improvement from doing that extra work. So a lot of kind of outside help on this due to the kind of different networks that Shane and I bring to the table. Common link, of course, is Dekang Lin, I guess. But then again I have to say this meta-optimize thing was really amazing to have all of these people thinking about your problem for a little while and saved me from having to become an expert in cost-based learning. And then the whole thing, my whole desire to visit latent SVMs at all was due to a discussion with Ming-wei Chang and Vivek about their own work with latent SVMs. So thank you very much. Are there any questions? Go ahead. >>: So have you tried, I wonder what would happen if you say that all the filters need to filter something out instead of ->> Colin Cherry: Oh, yeah, instead of picking one, just change it all to an edge? Yeah. My gut reaction, I haven't tried it, although Shane has made the same suggestion. So I should do it at this point. It's two strong data points indicating that this is something worth trying. If we change that or to a conjunction, I suspect we'll lose the trade-off. We'll wind up trading off only on regularization and not 22 on kind of assigning work to the various filters. That may be enough to just know -- anyway, it's worth trying, because it's certainly a reasonable alternative. And it would work in this setting, because it's kind of -- it's not really true, they don't all have to be on, but we're not upset if they're all on, but it's kind of the setting we were leaving, because when you train them each independently, then you didn't get to know, oh, maybe I don't have to handle this one, because it's only one link is being missed out of a possible N that I would gain, you know, if the other N minus one are being handled by these other filters. So it loses a theoretical advantage. better, because it's still con vex. Go ahead. >>: Let's see. myself. >> Colin Cherry: That doesn't mean that it's not going to work Which is another big theoretical advantage. So I mean, I have a series of questions. I'll try to restrain No, it's fine. >>: So I guess one question is, Hollingshead and Roark, one of the nice things about it is at least at a theoretical level, it drops the exponent of the procedure by one, right? Is there a strong correspondent here? I guess you've tried to be agnostic of the particular inference system used. But I don't know, can you take Eisner's algorithm and drop it to [indiscernible] on the N squared work? >> Colin Cherry: The answer is probably yes. I feel like we've done enough filtering at this stage that it's probably possible. But they had this interesting knob they could turn, which is just how much filtering they're doing in general. Which actually I'll have to go back and read their paper and see how they turned that knob with two different classifiers, because you'd have to be adjusting two thresholds at ones. Yeah, I'm not sure exactly how they wind up doing that. But they wind up accepting possible cells until they hit the point where they're at their desired level of speed. So my first thing is I'm a little more nervous about that, doing seven things instead of two things, I guess. And the other thing is I don't think anyone would notice. I think we could change the theoretical complexity of inference, like Eisner's algorithm, for example, and I don't think anyone would notice, because if you -- once you precomputed those scores, it just flies over the sentences. >>: Totally understood, but that 119-word sentence kind of scares me, right like 23 so that's where the end of the third is really [inaudible], you know. >> Colin Cherry: Yes, that's true. >>: And I mean, it feels like there's a lot of things you could potentially do there, like try to chop the 119-word sentence into regions and minimize the arcs between those regions or something. >> Colin Cherry: It's a good point. But auntly, we did not even -- even in my own parser that I'd written myself and understood exactly how inference is working, I did not take the time to bother saying, oh, if this link is being pruned, save that inference. You know, I just, we kind of, we only -- we made -- we intentionally to some extent only hit feature extraction the whole way through. So I don't even know, like, forget theoretical bounds, like whether or not I can take it down to N squared. I don't even know if there's an empirical improvement, but my gut instinct is that it would be minimal. Like you say, the longer sentences, it would matter. Go, keep going, sure. >>: So one of the things you suggested at the beginning was doing sort of combination between the transition-based and the graph-based parsers. >> Colin Cherry: Yeah. >>: So it seems like one simple thing and maybe this has already been done, I haven't read the paper is to take the [indiscernible] presentation of the output of the transition based parsers and re-rank it with the graph-based parsers, right? >> Colin Cherry: There's a lot of papers with McDonald and [indiscernible] both as authors. I'd want to go through all of those. So this idea of the combination came up during one of Shane's trips to Google, and it was something they had been looking at. And they were kind of at this -- at that point, they kind of just had written off the graph based parser as just too much of a bottle neck to worry about. Yeah, it is interesting. I think that anyway, there's been a lot of advances in dependency parsing in, like, the last year and a half that I'd have to go over, because I think a lot of these questions like of a packed representation of an output of a transition based dependency parser. Like Wong's work is relevant to exactly how efficient that is, and I think it might be a lot more realistic than it was when we started. 24 And then one final question, I'll cut it off after that. >> Colin Cherry: >>: We can talk later. I have to run. >> Colin Cherry: Okay. >>: Is so right now, you're doing a hard filtering, could you do something A star like instead, because you can use this as a prioritization function over links, right? Again, it means you have to dive into the inference procedure. But if you rank your links, you know. >> Colin Cherry: Yeah, no, that would make a lot of sense. So we're kind of getting into engineering concerns at that point, though, because there is a certain advantage of just flying over the ranking, the scoring all at once. Where we're kind of returning to scoring, you know, we do a little inference, then do a little scoring and do a little inference and do a little scoring. I wonder if we'd wipe out any improvements we saw. Is this the argument you always get when you look for A-star? I don't know. Yeah, that's my only concern there, like right now, it's algorithmically simple to have the scoring over here and it just happens once and it's just a table of numbers I look up. >>: [Inaudible]. Access a cell and if it's there, use it, and if it's not [inaudible]. I don't know. >> Colin Cherry: I think it's definitely valid and you'd probably, at the very least, you'd hope that you'd lose -- I'm nowhere near the table, but those little drops in accuracy should go away at that stage, which would be exciting at least. So yeah. >>: You have a bunch of [indiscernible]. >> Colin Cherry: always ->>: So there are conflict of [indiscernible]. Any filters always right. So filter always wins. So it's [inaudible]. >> Colin Cherry: If any of them says filter, we filter. At any of the stages. So a link only has to be knocked out once. So if anyone, like no one -- so what are 25 we talking about here? Are we talking about like the three-stage process? I started at the beginning of the talk or are we talking about like the seven filters at the end? >>: The first stage. >> Colin Cherry: The first stage, everyone only sees the output of the -- in the cascade, each filter only sees the output of the top, the filter above it, so we're propagating errors. There's no opportunity to recover. >>: [indiscernible]. >> Colin Cherry: Right, okay, okay. So that's, that's kind of this thing here. So I'm doing, like, ands and ors, and we could do votes. >>: [indiscernible]. >> Colin Cherry: I do not have a good answer to that question. It's kind of one of these moments -- so I think that's, I think that's potentially an important insight. I'm trying to work through it here. So you have to understand how we arrived at this is we were kind of -- we started with this and construction, you know, where any filter has to succeed in order to knock out the link. And then we kind of formalized it into looking at link-wise decision. But we never really asked ourselves if we could change this -- I mean, the and slide literally happened two nights ago. It was like, well, I got to present this to some people and I've got to figure out a way to characterize this relationship. Oh, it's a conjunction relationship. Okay, we're golden. Now that you see it like this, you can always ask yourself, why isn't it plus? You know, plus would be the obvious one. Or average, or something like that. And there's no reason not to. Because at test time, we can still, we still fly over all the tokens, and then as we fly over the N squared, we still don't extract any features. We just take a quick sum of the token decisions and then make our call. It's just like, it's very similar like to the question that Christina asked, why don't we change the or to the and over there? Now that we see it like this, of course there's a bunch of functions we could plug in they're, and I'm literally seeing it for the first time. So I guess I should try it. I guess I wouldn't have anything I could call it. >>: [inaudible]. 26 >>: So going in the opposite direction, like if you stuck with the ands and ors there, I mean, it sort of looks to me far more like a sort of a decision tree type problem than a linear problem. Especially in your first one, we've got this cascade of classifiers. I mean, it should be just one big decision tree, right? I mean, you should end up with basically the same result. >> Colin Cherry: Yes. Yes. >>: There may be some combinations of features you can do there that conjunctions of features that you can do there that you can't in the linear setting. >> Colin Cherry: Yeah, the decision tree comparison has come out before. And actually, once we start doing this latent assignment to filters, and kind of start Caving up our training set to say oh, these links are handled by this filter and these links are handled by that filter, at that point it's almost just like a decision tree, where your first decision is filter choice and your second decision is filter result. So there's definitely a connection there, and then again, theoretically, we're about on the same legs as a decision tree, because it's still non-con vex over here and whatnot. So yeah, I'd have to think a little bit more about how to set it up, but I do think it would -- the comparison, there's something to be learned from that comparison. I'm just, I'm not 100% sure on what it is. Thanks.