1

advertisement
1
>> Bill Dolan: Hi, so welcome. And our speaker today is Colin Cherry, who really,
really needs no introduction. I think everybody here knows him incredibly well from
his years here at Microsoft. Colin got his Ph.D. from university of Alberta in
Edmonton, and working with Dekang Lin and later joined us here as a researcher for
a couple years. How long were you here?
>> Colin Cherry:
Two.
>> Bill Dolan: Two years. We all miss him desperately and miss his brilliance and
are happy to him back here briefly to talk about -- very briefly, given his chaotic
trip, to talk about his work he's been doing on parsing lately. With no further
ado, Colin.
>> Colin Cherry: Okay. Thank very much. I'm very pleased to be here. So I'm
going to talk to you about applying some filtering techniques to dependency parsing.
This is joint work with Shane Bergsma, another student of Dekang Lin's, who's now
at Hopkins doing a post-doc there.
So I think most people in the room remember that there was another talk that I came
and gave at MSR once upon a time for my job interview. And so in that case, I had
planned again to fly in and have a nice, relaxing day beforehand and then drive in
in the morning to do my talk. But my flight was cancelled due to Seattle weather
and I came and kind of poked fun at you guys, because at the same time we had had
much more atrocious weather in Edmonton and the airport was still running. So that
was my out, my winter outfit which I had taken the night when I missed -- when I
was supposed to be flying.
So this time around, due mostly to the fact that U.S. customs hates me, I wound up
staying the night in Calgary with my brother, who I was, like, I need to take a funny
picture. I got this thing, when I miss my flight, and so we decided since I was
in Canada's cow town, I would get in my cowboy get-up and kind of look a little
perturbed there.
So then I was supposed to come in, and I'd say the moral of this story, the travelism
you should take away is book afternoon talks, because you'll always wind up flying
in in the morning. My morning flight was cancelled due to weather and I ended up
flying in and missing one of -- well, missing most of the day yesterday.
So the real moral of the story is, never travel.
So I'm going to talk to you today
2
about two projects involving dependency parsing, where I'm basically going to look
first of all at just kind of working as hard as we can at speeding it up by taking
this idea of kind of filtering out the head-modifier pairs that could exist in a
sentence before you start parsing it.
And then I'm going to, for a second part of the talk, I'm going to look at it from
a bit more of a machine learning perspective, and I'm going to kind of ask if we
do have a bunch of filters that are passing over the same sentence and kind of
overlapping with each other, is there a way that we can get them to train jointly
in order to improve their performance.
So I'm going to start with the first topic. So talking about just filtering
dependency parsing, basically our goal here is to speed up graph-based dependency
parsing by removing implausible head modifier pairs before parsing begins. And this
was motivated by a number of things. I've had a few people tell me, like, oh, the
reason we don't use a graph-based parser is because feature extraction kills us,
and we were actually trying to get a bunch of interesting semantic features for
graph-based parser and again we went to our lax Cal semantics guys and said can you
get us this for all these word pairs, and they said no. We're not going to do that.
Sorry. They cut it down by some significant amount, and maybe we'll go and get you
those word pairs, get you some semantic information for them.
So we never went back to them because we had so much fun filtering out the word pairs
after that. But we will soon. So the method is there's going to be a process,
classifier that's going to actually try to make a point wise decisions about a tree
in a high precision manner to we're not going to actually harm parsing accuracy.
The result is you could remove 78% of the arcs you could remove in a tree before
parsing even begins and you only lose less than 1% of the ones that you would have
liked to recover.
So we're going to show speed-up results on two dependency parsers. So dependency
parsing is probably familiar to most people here in the room. But you get an input
sentence, and traditionally, it's part of speech tag sequence, and then you wind
up with a tree structure here where it's going to be individual connections between
words that are going to kind of indicate dependencies or head modifier relationships.
So ate is described by the fact that it's Bob that's doing the eating. He's eating
pizza. And he's doing it with his fork. And so it's kind of important to have the
tree, because that way you get to know he's eating the pizza with a fork, like he's
eating with a fork as opposed to he's having a pizza with fork, like you have a pizza
with pepperoni.
3
So this is Shane's slide. My slide might have had more about the semantic features
that we were going to try to get later on, but basically, our motivation was there
were just a lot of these word pairs to consider, and also it's motivated by the fact
that of the two competing formalisms in dependency parsing, graph-based and
transition based, a graph-based does tend to be the slower option. And it does tend
to make orthogonal errors. So transition based is very fast. But if you can get
graph-based up to the same speed without sacrificing any of its accuracy, then you
could imagine doing some sort of combination.
And furthermore, you could imagine doing a lot more interesting things if you had
to consider -- if the whole process was faster and you didn't have to consider such
a big problem when parsing.
So to kind of further motivate this, we'll just talk about how graph-based dependency
parsing works for a second. So the paper was written with arcs. There's a Coling
paper on this topic, and we set arcs every time we made a connection between a head
and a modifier. I hate saying that word. And I hate typing it, it turns out too.
So I'm going to, like, interchangeably say arcs, links, edges. There's only one
thing we have to worry about here. It's a line from a word to a word. It's a directed
link so I really apologize. It actually switches back and forth a few times during
the talk. And I'll probably not say the word that's actually on the slide most of
the time.
But you can see here that the score of a tree, so you're going to take the tree that
maximizes some scoring function over all possible trees, and then the score of the
tree is going to be the score of all of the edges or links or arcs in the tree where
that score is -- where the link-wise score is calculated by this dot product with
the weight vector multiplied by some features, extracted by the head, the modifier,
in context. So the S stands for the sentence.
And then that argmax can be computed efficiently with, say, the minimum spanning
tree algorithm, which is very fast, or a projective dependency parser, which is also
very fast. And the kind of hidden expense here is that all versions need to kind
of compute these inner scores here for every single edge or arc or whatever in the
tree that you're going to be evaluating. So every possible connection, every
possible directed connection between words needs to be scored before parsing begins.
And then the parser actually just flies over it after that feature extraction step
is done.
4
So even though I've written, like, the scoring process is technically only N squared,
this F factor is actually very large. And, in fact, it's normally larger than N.
There's normally more than N features. And this extraction step is a little slow,
and then the dot product is fast, but it adds up.
So just to kind of drive that home, say we're considering the link between ate and
with. In that setting, then we are going to consider all sorts of things, like just
the word ate on its own, the word with on its own, all sorts of features of those,
plus we're going to look the two words together, plus we're going to look at
everything that happened in between them.
So I've written down 20 features here. On average, we use 60 in a high accuracy
one. If you really want to push it to the state of the art in accuracy, you're going
to get up around more than that 120, maybe even 200 using some cluster based features
that have been advocated by Terry Koo, for example and then you can join them with
direction and distance. So it really does add up to a lot of things just to know
how likely it is to kind of connect up ate and with. And furthermore, we're going
to ask it to build these same feature vector for, say, the to his. So maybe the
and his are in some sort of syntactic relationship with each other.
Well, there's a lot of reasons why you wouldn't want to bother with this. The isn't
usually the head of anything. And this pause is usually has its head, if it's going
to have one -- well, it always has a head, but it's going to be on its right, usually,
not on its left.
So we propose this three stages of filtering, where every filter is going to feed
into the next one. So each one is going to be progressively slower, but it's going
to have to do less and less work, because we're going to be knocking out links left,
right and center.
Each stage is a supervised SVM classifier. If you're working with Shane Bergsma,
which I recommend, he's a great guy to work with, you want to work with SVMs. You
can kind of just do magic with them. He's very effective at getting them to do what
he wants. And then we extract our training data from decisions in the tree bank.
So we can kind of look at the tree bank to get all of these training pairs for the
SVM.
The important thing that we do here is the SVM is biased at every stage to be
incredibly high precision. We optimize this, what we call a J-parameter, which is
a per class cost factor in order to make sure we're getting -- making almost no
5
mistakes that will eliminate a true link from the tree.
The very first classifier that we built was kind of, we just kind of thought of a
bunch of easy decisions we could make about a tree quickly, and then we said, every
decision is going to have exactly one feature or maybe two, and those are going to
be the part of speech tags. Of the two words involved or maybe only the one word
involved. The result is we get this table of rules, because the minute the SVM
latches on to a part of speech tag at all and gives it any weight, we can just say
the minute we see that tag, we're just going to knock out, knock out the links
involved.
So we wind up with a rule that just says if the quotes, the comma, the brackets,
it's just not ahead. Just don't worry about it. Furthermore, because we did this
with the SVM and with this high precision bias, it's only going to pull out rules
for us. We didn't make this table ourselves. It's only going to pull out rules
that are structural zeroes. It's going to pull out things that it's seen frequently
enough that it would have expected to have these events happened statistically, but
they didn't.
So we can now say the head is never to the left for these symbols. Head is never
to the right for these ones. So you can see some of the rules are somewhat useless.
Head is never to the right for the period. It's only, I mean, yes, you're always
right. But there's no words to the right anyway. But not a root. You can just
kind of knock out a bunch of decisions right here and you can just fly over these,
because year going to have to eventually consider these pairs anyway. So just the
end squared step, but just looking at the two part of speech tags is super fast.
So it doesn't take any time at all.
>>:
Assuming that the part of speech tagging is correct.
>> Colin Cherry: Yes. Well no, but we give it -- you can train it with noisy part
of speech tags. So the tags that are as noisy as what you're going to see at test
time.
>>:
[inaudible].
>> Colin Cherry: Actually, this table wasn't done with that, but it gets much better
if you do it with that. We've done it since then. It gets a little bigger too,
actually. But that's kind of the least interesting stage of things.
6
The most interesting stage happens in the middle. This is the linear classifier,
and this is going to work on one token at a time. So we're going to make as many
changes to the -- and many restrictions on the tree as we can, and we're going to
be allowed using rich, interesting feature vectors, but we have to only look at a
single token at a time. That way, we can still fly through the sentence. If we
looked at two tokens at a time, we'd be pretty much back in the same situation we
were in when we started. We can make a situation again that a word is never a head.
That it's a leaf in the tree. This knocks out N possible links the minute you make
that decision. Because anything coming out of that word gets knocked out. So it's
N minus one but it's still good. Head is on the left or right. Then we kind of
went crazy with the left to right idea. Kind of said left to right within five,
because that's going to be true a lot of time, because the links tend to be short
and immediately left to right, that's not true a lot, but man, if we can get it,
then again it knocks out a bunch of links.
Same thing. The root, the root is notoriously hard to pick out in dependency
parsing, but there's some cases where it's obvious, about ten percent, you're safely
able to say this has to be the root of the sentence. And if you do, you can rely
on projectivety to set up a barrier to the sentence. You say wherever that root
is, any arc that's going to cross over it, we can cut that out.
If you could do all of these decisions perfectly, like if you got to see a decision
function that was by making this decision, I can -- I'm only going to do it if I
don't hurt the tree in any way, shape or form, you could filter 90% of the links
that you would be evaluating normally before parsing begins.
We're not going to get to 90, but we'll get -- anyway, you'll see the results.
So and then kind of the idea that makes this whole thing pop, because feature
extraction is expensive, is that we use the same feature vector for each of the eight
classifiers. So not a root has to use the same feature as not a head, has to use
the same for head to the left. You build it once you, multiply it with eight
feature -- eight weight vectors, because dot products fast, and you're in good shape.
And then the features are kind of boring. It's just look at the tag, look at the
word, look at the tag and words nearby. Look at, you know, are you near the end
of the sentence, the beginning of the sentence, things like that.
You can get even better results, now we have eight different things all trained
independently for high precision. If you kind of take your top three or four
candidates for each one of those filters in terms of high precision filters, but
7
they all kind of have different precision recall trade-offs, if you take them and
then blow out all 6 thousand, 60 thousand combinations on your development set of,
like, parameter values times all the eight filters, time four parameter values each,
for example, would be about 60,000. Eight times three is about 6,000 each.
If you do that, then you'll get better results still, because you'll find out that,
oh, I thought this filter was making lots of mistakes, but it turns out that it was
mistakes the other ones were making anyway or something like that. So you can kind
of trade them off against each other.
So this is actually going to be the slide where the next portion of the talk is going
to begin. We're going to take that idea even further in about ten minutes.
Then finally, the last stage of filtering is this quadratic stage of filtering.
We've now had rule based where the rules were selected by an SVM, then linear, where
we fly over the tokens, then quadratic, which is size of F times N scared and its
complexity. It's this whole step we were trying to avoid in the first place. So
why bother?
Well, we had this discussion back and forth a few times. In the end, empirically,
it won out that it's still worth doing. Because it can be a light preprocessing
step. You can use less features than you would have used. And in particular,
there's these troublesome between features which are somewhat powerful, but you
include all of them all the time, then it actually, it gets quite expensive. So
you can include not all of those between tags at the filtering stage and still make
your high precision decisions and pull out the heavy feature set for the, when you
finally get to the, like the cream of the crop, whatever remains to be evaluated
at actual parse time.
And furthermore, if you wanted to do this correctly, probably all of the features
you pull out for the filler should just be cashed to be used later on during the
parsing process. We didn't do that here.
>>: So just curious. Did you continue on to treat this as a course defined process.
Like you can use the course in this outside course in your research.
>> Colin Cherry: No, I didn't think of that. That would have been smart. So we
kind of avoided throughout this work stuff that was in any way algorithm specific
for the parsing. We kind of wanted everything to just plug right in, regardless
of what your inference engine of choice was for parsing. Certainly, course defined
8
is conceivable with the projective algorithms, but I would have hard time figuring
it out for MST, for example.
It's a good excuse. I can pull that one out a lot for a lot of suggestions. I'd
be like, oh, but does it work for minimum spanning tree? Oh, sorry. No, but it's
a good idea too, the course define idea is good. And that was actually up here in
the slide once upon a time as related technologies.
So talking about filtering dependency parsing, the first thing that comes to mind
for me at least is vine parsing. This is work with Jason Eisner and Noah Smith,
where you have this hard cap on arc length. You just say most dependencies are short,
we're going to punt on those and we're going to fly over the ones that we can get.
Turns out that if you depend on the tags being linked, and their direction, you
actually wind up with something that looks a lot like our rules, but with distances
appended to them. And that's fast and fairly effective.
And then there's been extensions that looks to varying degrees like what we did to
this, but nothing has ever been tested on an actual state of the art dependency parser
up until this point.
Obviously, our linear work, I think, has been heavily inspired by this CFG cell
classification work by Roark and Hollingshead, which is this way to speed up your
constituency parsers. And another competing way to do this would be, of course,
to find, of course, but it can get inference specific.
So our experiments is we took the standard splits of the English tree bank. We tagged
them with the Stanford part of speech tagger. The results I'm going to show here
are all, everything is trained on gold tags, and then tested on noisy tags. Turns
out that everything improves a little if you jackknife and train on noisy tags the
whole way through.
For just evaluating the filters, we're going to present coverage versus reduction.
Coverage is like your recall of true links, how many true links were you able to
come back with and have available for the parser at parse time. Reduction is how
much effort you've saved it in scoring things. And then at the very end, we'll
evaluate, of course, the accuracy of the parser.
So just looking at the filters, you can see tag-vine actually does pretty well. You
can, if you pick a good cut-off, they have a very simple algorithm where you just
kind of slowly reduce the distance for every possible tag pair. You always make
9
the step that kills your recall the least and then you cut it off at whatever your
desired level of recall is.
And if you do that, you can actually get a 44% reduction while maintaining 99.6.
If I had to implement what we did in an afternoon and notch, like, over the course
of a research project, you could do a lot worse than this vine parser proposed by
Eisner and Smith.
Of course, you can do better, which is why I'm here talking to you. So the rule-based
system here from the rules from the SVM, you can, knocking out 1 percent of true
links, you can reduce to 75% of the work you had before. You can knock out 26% of
the links. Then sacrificing a little more recall with the linear, you can get to
50% of the effort. And then finally, you can knock out 78% with this quadratic
filter. So it's actually doing a lot of the work of parsing at this stage.
These are cumulative speeds here, so each one of these runs on top of the other.
So you can see it's one second for the rules, then it's another seven added in and
then another 16 added in here.
So this is winding up to be somewhat expensive.
or not this last step is really worth it.
It depends on your parser, whether
So we test this on the Ryan McDonald's MST parser, which can be downloaded from the
web and modified fairly easily. So we modified it to admit filters. And then
there's also the, I guess we can all it the NRC parser, but for the purpose of the
paper, it was called DepPerceptron. It's always made me laugh. It's like depth
perception. I don't know. And that's trained with an average perceptron, and it
just kind of uses a greatest hits list of the features of all the people who have
kind of come before.
So I'm, I really, I tried all sorts of visualizations of time and accuracy
simultaneously, and I couldn't come up with one. So here's a big slide of numbers.
Here's the only numbers I want you to look at. MST-2 is probably be what people
are going to use, maybe they'll use MST-1. So this is second or first order parsing.
MST-2 is the one where you kind of take a big speed hit for a little more accuracy,
so I thought that would be an interesting case to visit here. And you can see here,
unfortunately, the times you kind of have to do this weird kind of collect up all
the times here and add them here. So I'll just do it for you and tell you that MST-2
goes from 12 sentences a second to 23 under these filters.
10
And furthermore, mine was -- my system was kind of developed with the filters in
tandem so I had no idea it was so slow. But we took them out and then it was crazy
expensive, it turned out. But you can see here that you have this bigger impact
if you -- if you're kind of really pushing on the second order features a lot. So
my system is actually missing a few key decompositions where you only use m-m pairs
as opposed to h-m-m triples. So this is head and two children is what second order
allows you to look at instead of the head and one child. I was always looking at
all three of them at once. Turns out MST when it can only looks be at two children
at a time. That's why you only see a small improvement here but a large one for
here. But if you had a system that was working on those triples all the time, these
filters would be golden.
And certainly, Terry Koo's cluster based features do fit that description. They
have a lot of stuff that works on the triples. And finally the vine, it kind
of -- it's good, but we do see this slight accuracy hit a little early happening
here. So it's kind of telling you that even though it's filtering less, it does
matter what the character of your errors are, whether or not -- at what point you're
kind of going to introduce that next stage of error. But I'm sure all of these errors
are well within statistical significance of each other. I'm sure it's all virtually
the same.
So in conclusion, presented linear and quadratic time filtering techniques, which
can lead to speed-ups even in carefully optimized dependency parsers. There's a
negligible loss in accuracy that you pay for the speed and the code for doing the
filtering, not the parsing, is available on the web, and it's all trained up on the
treebank already.
All right. Take a little breather, a little drink. So next topic. If we find
ourselves in a situation where we have a number of filters that are overlapping with
each other, can we train them together to improve them?
So the goal is to improve performance with filters when they overlap. The method
is going to be an old friend, the latent SVM is going to rear its head here again.
It's work that Chris and I looked at, again for a kind of parsing related project
while I was here. This is -- anyway, it will be interesting how it comes up. I
think it's a better fit for this problem.
And then the result is going to be, first and foremost, it's going to be a principled
method to optimize filter combinations. Rather than just blowing out a bunch of
kind of good looking candidate sets on a development and like kind of doing all 60
11
thousand combinations and seeing which one works well, this is going to be like an
actual learning method that's actually going to be doing something regularized and
reasonable with this joint setting of a bunch of filters interacting with each other.
And I am going to show some improvements from weighted F measure for filter quality.
So there's going to be a few motivating observations here. The very first one is
that what we are doing with the linear filters, so this is the setting of this talk
is all I'm just going to zoom right in on the linear filters from the previous talk.
What we are doing with the linear filters is we are being a token wise classifier
to speed up parsing. And so we walked over all the tokens and we made this call
about whether or not it was a head or whether it was not a head. So Bob, not a head.
Ate is a head. So it's false for not a head. Not a head, head, head, not a head,
so on.
So there's already this interesting test, train test criterion mismatch, because
we're training on tokens. But at the end of the day, we're going to evaluate on
links. We're going to be counting how much links that we correctly -- that how many
correct links survived and how many wrong ones got filtered. So there's this
mismatch happening here. So that's a little worrisome. Kind of the opportunity
light is kind of firing to some extent. There's a chance to make a development here.
Furthermore, we've got this overlapping tech thing happening. So here we have the,
and we want to get rid of the link from the to his. And there's four ways to do
it. If we have, if we correctly classify this as not a head, the link is gone. We
never have to worry about it again. Furthermore, any of the head to rights, any
of the flavors of that are going to eliminate this link, because it's going to mean
the link is not on the left. And it happens to be that all three of them are true
in this case.
So is there any way we can leverage this redundancy and this setup in order to improve
accuracy?
So the evidence for number two is earlier in this talk, I gave this slide where I
kind of talked about this process where you blow everything out on the development
said and try a bunch of different combinations and you get be a improvement.
>>: Is this some kind of meta technique where you set evidence for your thesis by
setting one of your earlier slides? Daring technique.
>> Colin Cherry:
Well, I mean, if you bought the earlier talk, then you kind of
12
have to buy into the second talk.
So, I mean, you're already sitting here.
[laughter].
>> Colin Cherry: Honestly, I just wanted to see one of my slides in a nice little
frame. It's all I was really.
>>:
[inaudible].
>> Colin Cherry: Yeah, that would be great. No, unfortunately. So the process
of this kind of joint hyperparameter optimization that we did earlier is that it
does introduce link accuracy as a criterion, because that's how we evaluate the
hyperparameter combinations. Although it's kind of late in the game. We're
already trained a bunch of classifiers. Now we're just kind of picking between
hyperparameter choices. And it does account for all the filters at once.
So it kind of achieves those two problems. But we're going to try it to do it better.
Mostly because this process is clumsy. And it's easy to say, but then when you sit
down to do it, you say oh, who is in my candidate set and how many things should
I have for the candidate set and I guess I need to be able to do it really quick
to kind of allow myself to check all exponential number of combinations. So it's
kind of ugly. So we'll make it less ugly.
The other reason I'm interested in this is it could benefit other people. In
particular, other filters. So we're not the only ones doing this kind of work in
Hollingshead's style of filtering. Obviously, there's also Roark and Hollingshead.
They do constituency parsing. I can't remember what the speed-up was, but we're
using it in the parser here. So it's three to four times speedup. And it's from
this idea that you can tag things as either not beginning or not ending a multiword
constituent.
Same time, there is this interesting paper at NACL this year, where they do the same
thing, but for multiword translation regions and they're asking whether or not a
token can begin or end a cohesive translation region. Whether or not you can apply
that sort of idea depends on what sort of decoder you're using at translation time,
but if you're using an ITG decoder, it kind of plugs right right away, and they were
showing a one-point BLEU increase on a strong system. So there's speed, there's
accuracy. And they're both have two filters making overlapping decisions again.
So there might be an opportunity for improvement there.
13
Okay. So that's big enough to read. Good. The intuition of what I'm going to talk
about for the rest of this talk is that we're going to try to classify links with
token features. So we're going to build a classifier over links, but it still is
only ever going to look at one token from that link at a time. It's never going
to look at both at once.
That's going to maintain hopefully these three desiderata. One, we're going to be
able to train on link accuracy rather than token accuracy. Two, this is kind of
the big one, we need to be able to retain the computational advantages of flying
over those tokens at test time. So we can take as long as we want during training.
I've just declared it. You're not allowed to call me on it later. Take as long
as we want during training. But at test time, we want to be able to fly again, just
look at one token at a time.
And finally, we want to train these filters so that they're all in there at once
and they all trade off against each other.
So let's look at the links from a token perspective. So here's a link we want to
keep from ate to with. That wants to survive. What could go wrong? What could
kind of ruin our day here if we're doing filtering? Well, actually, a lot could.
You could declare ate as not a head. You could declare the or pizza as the root
and kind of set up a barrier, or you could declare with as any of the flavors of
head to right. If you do that, the link is not going to happen.
So in order for this link to happen, you need this conjunction of things to happen.
You need this and this and this and this and this. And this just describes the kind
of comparison that each classifier is doing. It's always comparing a weight vector
to a single feature vector of that token.
If you look at a link that we want to eliminate, so a link that should be filtered
according to our training set, then we get this different kind of relationship.
There's a bunch of things that could go right here, rather than a bunch of things
that could go wrong, and so we wind up with this or relationship. So if this guy -- if
the is labeled as not a head, we're good and we're done.
If this, his, is labeled as a head to right in any of the flavors, then again, we're
done. And so we wind up down here, with an or relationship between all of these
things. So any one of these can hold ->>:
[indiscernible].
14
>> Colin Cherry: Yeah, we're going to leave out the ones that will break other links,
though, because we're kind of driving the learning process here. So there's no sense
in encouraging it to -- you're right, that would get us here, but it would also get
someone else that we don't want. But yes, the roots would also break this particular
link. And actually, the technology kind of works if you ignore that. But for now,
let's assume we're just going to give things that are kind of safe decisions. So
these are kind of our constraints on our learner if we want to maintain those links.
We want this conjunction to hold and we want this disjunction to hold.
And that kind of asymmetry is exactly what I decided I didn't like about latent SVMs.
Now I love it. Because it's what I need in this situation. This sort of and
relationship and or relationship here, where the and always happens to be with less
than symbols and the or always happens to be with greater than symbols, can be summed
up with a max. So if we take the max over all filters that could cut out this link,
and say that the max overall the scores has to be less than zero, that's the same
thing as saying that every one of them has to be less than zero.
Furthermore, if we look at all the filters that can cut out a link saying the max
has to be greater than zero, that's the same as saying at least one has to be greater
than zero and we don't care if there's more than one that's greater than zero.
So this sort of classifier, with an inner max that's making your decision, is exactly
what the latent SVM was designed to do. And so now we have a somewhat proven
technology that we can just kind of apply to this problem right away. And so latent
SVMs have been used originally for latent part models in image recognition and we
had latent parse trees in sentence recognition, work done with me and Chris and latent
alignments most recently with some of Dan Roth's students in paraphrase recognition,
also entailment recognition and cognate recognition. So people are using this and
it works. The thing that we're going to do here is that our latent structure is
actually going to be kind of dull. It's just picking a filter.
So it's just picking one of these filters that could hold in this or. Rather than
building a whole tree to justify some decision, it's just kind of saying, I'm just
going do my max over these kind of -- it's more of a multiclass sort of latent
structure, where it's discrete over filter choice.
So here's my math slide. I'll keep it brief. But main take-away point is it's
exactly designed to work with when you're taking the sign of a max of dot products
as your classifier, that is when you are -- that is what a latent SVM builds. That
15
happens to be exactly what we need here.
Here's your objective. And you can see it's just a normal hinge loss with an inner
max and that inner max breaks our convexity and kind of introduces the need for a
more complicated algorithms. You wind up with an EM-like hill climb, which I'll
talk about now.
So what it winds up happening is you kind of start with filter selection, where for
every link that needs to be filtered, you pick exactly one filter to handle it. That
leads to this. This stands in for that choice of one filter for every link in our
training set that we know needs to be filtered.
Then you can strain a structural SVM on that problem, and then you get, then you
get some weights. Then you can use those weights in order to pick new filters for
the ones that need to be filtered.
So this distinction that I'm making where I say filters for the links that need to
be filtered, so the links that don't exactly exist in the tree at the end of the
day, that's very important. We need to always maintain this conjunction. But
fortunately, SVMs handle conjunction no problem. SVM handles these or, or these
and relationships with no problem and that's exactly how a structural SVM works.
If you're training a SVM, like Ben Tasker's max margin alignment system, so if you're
doing something like that, then you're saying this alignment needs to score higher
than this one that's wrong and this one that's wrong and this one that's wrong. And
that's just a giant and. So we already know how to do the giant and. It's the or
that SVMs can't handle. But if we pick one filter from the list, that eliminates
the or, and we're still kind of satisfying the disjunction, because we're training
for one of the things on the list. We're not just training for any of them.
The problem is the one that we pick might not be optimal. That's why we iterate.
We iterate to kind of try to -- that's why this loop is here, because you can smooth
out some bumps in the learning surface by trying to pick the items from the
disjunctions that provide less resistance toward learning a good system.
If you're used to working with latent variables while learning, like say in a maximum
likelihood model or something, another way to look at this is normally what you do
when you're doing learning with latent variables is you have one component of your
objective that looks at the completely unconstrained problem, and the other where
you have to constrain it so that you know the right answer, but you still do some
processing over your latent variables, given the right answer.
16
Here, we're doing the exact same thing. Just happens to be that the answers are
filter or keep, you know, either filter this link or keep it. And when we know the
answer is keep, we don't have to do any work. We know that the right answer is there
should be no filter. Like we don't have to iterate over a bunch of choices. If
the right answer is keep, then no filter is the right filter. So that's another
way of looking at it. And that's actually how I wound up implementing it.
So just an example, iteration, you always have the conjunction in here, in your SVM,
constraining all of these things to be true, but maybe you just pick head to right
one as your first time. Maybe you pick it randomly from that disjunction that we
had earlier. And then later on, when you have better weights, you find out that
it's 'em safer and easier to declare eight is not a -- sorry, that the is not a head.
This guy survived from copy and paste errors.
So let's hit that desiderata again and see how we're doing. This filter selection
step is doing exactly the trade-off that we wanted. A particular filter only sees
a token as a training example if all of the links coming out of or going into that
filter -- coming out of or going into that token are not being filtered by someone
else. If it happens to be that this token is covered completely, then it's just
going to kind of disappear from the objective for other filters, for example. So
if there's a bunch of strong filters that are doing a good job, then the weak filters
can kind of concentrate on the holes, or vice versa.
Furthermore, we incur one instance of hinge-loss per link. So that means if a token
is being handled incorrectly, and say it's cutting out 30 links from the tree, because
we're getting it wrong in the training set, that would be -- that would have to be
a not a head decision, where that head had 30 children.
But okay, say we're losing five links. That's more reasonable. You can have five.
A word can have five children. So you would actually incur every time you looked
at each of those five different links, you would incur the same mistake. You'd say
oh, well, I'm filtering it because that not a head decision. So you actually, you
kind of have link accuracy right in there, like a link hinge loss, right in there
in your objective and then your learned weights can still be applied to tokens at
test time. It's still fast.
The cons that we've picked up along the way is that the training procedure is not
con vex, and the training set is large. It's over links, not tokens now. So we
kind of asked for that going in, but turns out that there's a lot more links than
17
tokens. In fact, there's 20 million links, there's only 700,000 tokens.
need a large-scale SVM in order to handle this.
So you
So we wound up actually rolling our own. Shortly before I left here actually, I
kind of discovered primal gradient SVM and decided that was my new favorite thing
so this seemed like a good place to test it out. So it's fast, it doesn't use a
lot of memory and so we can get a decent answer, even with this 20 million problem,
in about an hour. We can get a better answer in three and a half or four. So we're
in pretty good shape, actually.
It's a non-convex problem so you do need to initialize to a good starting point.
Fortunately, we have a whole list of starting points from the previous talk so we
just took our system. We're not doing any sort of joint optimization here. We just
kind of quickly trained each filter independently to be high precision and then used
that as like -- we appended all those weights together and called that the new weight
for our big, joint system. And that became its starting point.
The initial filter decisions are actually pretty good.
And then it's also really important in this setting. This is why I kept saying,
oh, we have results for jackknife. They're just not in this paper, for jackknife
part of speech tagging, because this thing falls apart if we're not doing jackknife
part of speech tagging. Because all of the trade-offs between filters are being
done on the training set, not on the development set. In the other setting, we were
doing a bunch of trade-offs on the development set where the part of speech tags
were realistic. Here, we started doing trade-offs on the training set where the
part of speech tags are unrealistically good and the system just came back and said
always use not a head. Just devote all of your resources to making not a head as
good as possible, and don't stress over anything else, and I was like, well, but
it doesn't work when you go to development. So we had to jackknife in order so that
it knew to do -- that the not a head was less accurate when you didn't have perfect
part of speech tags.
So jack knife, we used an in-house part of speech tagger to kind of, because -- to
retrain a bunch of different times on different segments.
>>:
[indiscernible].
>> Colin Cherry: Yeah, we retagged the training data with tenfold.
is trained on the other nine-tenths.
So each tenth
18
So finally, the thing that really slowed us down was learning with costs. In the
other problem, we kind of could have just used the cost parameter that was built
into lib linear, which is our SVM package. Here, we were rolling our own SVM so
we kind of had to figure out how to do costs correctly, kind of from the ground up.
And took a little while. Turns out that the decision to go with primal gradient
SVM was somewhat suboptimal in that setting because it's actually, it actually
becomes very difficult to do class specific cost parameters there.
Fortunately, Joseph Turian has started this thing called Meta Optimize. Have you
all seen this? It's a good resource where you can kind of go on and ask a question.
I've kind of explained my set-up, my issue, and John Langford came on and was like,
oh, yeah, I've got a black box solution for costing. It doesn't matter what your
learning algorithm is, and you kind of re-weight your examples. Tried it out and
it works and it kind of took this thing from not working to working overnight, more
or less.
So but that's really important, because we need to know that like the learner, because
it's doing the trade-offs, it needs to be aware of the cost of getting a true link
wrong the whole thing through. Because if it doesn't understand that, then it won't
be making the correct trade-offs.
So we test in the exact same setting as before. Again, English treebank. Before
I kind of just showed you points in this filtering chain that were clearly better
than each other. I was like oh, I've given up a little itty bit of accuracy and
a huge gain of filtering. So obviously, you're going to buy in.
Now, unfortunately, we're working with a machine learning method. We already had
a strong baseline that we already published a paper on so we're going to have to
look at weighted F measure to kind of know whether or not these trade-offs that we're
getting are good at the end of the day. So that's what we're going to look at, in
particular a weighted F measure where we weight the recall of true links 25 times
and then 50 times more important than, you know, any gains we get by cutting out
links.
So our two baselines, one, where you independently tune each filter on a development
set so each of the filters get trained and we tune their parameters to settings that
look good to us on the development set. So what we wind up doing is changing good
to us to being, to optimizing either F25 or F50 independently on the development.
That makes it a little more well defined, a little more systematic. So we wind up
with 16 hyperparameters, each trained two at a time.
19
Then you can jointly tune where you start with the four best candidates for each
filter here, or three best and then you try all 6,000 combinations on development.
Again optimizing for either F25 or F50. And then we can tune our two
hyperparameters. We have the same things, class planed parameter and a
regularization parameter. So we can train our two, optimizing on development for
F25 or F50 again. And this kind of black box costing that I alluded to earlier needs
its own settings, but we didn't optimize those for F measure. We just did ten cost
weighted samples of our test set, or of our training set. And that winds up being
this kind of four-hour training procedure. So it's not that slow.
And here's the results. Just to keep myself honest, because the table looked a lot
better earlier, I decided to put in just using the rules. So this is all built on
top of using that rule-based filter that we had in the earlier talk. There's no
reason to throw that thing away. It's really, really precise and it cuts out 25%
of the links. So we might as well start from that always. You can see that it scores
very well on F25 already and it scores very well on F50. This is one minus the scores.
Lower is better. When you're starting at -- this guy starts at 99.2. When you're
starting at 99.2, you need to kind of show one minus to see the differences at all.
The take-away point, you can see by glancing at this, that it's better to do something
than nothing. Immediately, doing something cuts you down. Doing linear filtering
at all automatically cuts you down from three to two here, and from 0.8 to 0.6 or
something here.
The differences between systems are a little more difficult to spot, but this one
you can see it, because it's conveniently distant from the line. So you can see,
we are kind of getting our trajectory, and we are improving this hyperparameter,
the joint hyperparameter baseline is actually quite a strong baseline. I mean, it
has all of our desiderata. It's just a little ugly. So but we are improving on
top of it.
So -- and you can kind of see here in a little bit more fine grain what's happening,
because these F measures don't really tell you the whole picture. Optimizing for
F25 is actually a little aggressive. Turns out that that trade-off does prefer,
you know, maybe a little bit more filtering and a little less recall than we might
be interested at the end of the day.
But here are the trade-offs that you kind of get. And you can kind of see our winner
here is 99.5 at 67, where you can see that using that quadratic filter which, by
20
the way, you know, is three times more expensive in terms of time, you do get a lot
more filtering for the same reduction. But I think this is good. Like if we were
willing to accept at 99.5, we were willing to, you know, 99.5 was good for getting
four out of the five links out of the picture before we start. Now we can get two
out of three at least. Two out of three is not bad, as far as I'm concerned. And
that's with the seven second fast system, rather than the 20 second quadratic system.
Over here, we're actually kind of by coincidence optimizing for F50, we landed right
near our result from the Coling paper, which is nice, because it allows you to kind
of see the progression. So these two are actually using the exact same technique,
but you can just see the benefit from jackknifing is actually huge from getting the
part of speech tags realistic is actually quite a bit larger than the benefit from
using the improved learning technique. But there is still an improvement here where
we're getting higher coverage and slightly higher filtering. So I'll take it. It's
also a heck of a lot less ugly.
So what I'm ramming up against now, I'm not considering this work finished. Still
trying to kind of get the numbers up a little bit higher. Is that all of the methods
suffer a little bit from the fact that at extremely high levels of recall, like when
you're arguing between 99.5 and 99.6, in terms of recall, you start stressing over,
like, 60 links cut out of the system. And it turns out that 60 links cut out of
the system actually happens to be the exact difference between if you projected the
number, like the prior probability of linking, given a potential link, the
probability of actually maintaining that link in the tree. That's the difference
between the development and the test set. The developments that you always, you
always kind of get 60 links for free. Because it just happens to have a slightly
less high probability of linking. There's fewer links in the development set. It's
mostly due to this 119-word sentence that someone put in there.
So the solution to my problem may be just knock out the 119-word sentence. But I'm
trying to find ways to get sentence length, because you can see that prior probability
of making a link changes depends on length of sentence. So I'm trying to find a
way to get that into the model. Hopefully, just as a feature. It might have to
be in there in some more creative way so that the system is kind of aware of the
fact that there's this independence assumption we're making by running it as a
classifier that's not really true.
I don't know if it's the independence assumption that's getting us more so than the
identically distributed assumption, where basically the prior probability of
lipping does depend on some measurable quantity, which is sentence length. You
21
always make one where N is the length of your sentence, you always make one and N
links.
So and then before I try to publish this, I really think I should kind of hit the
other big filtering problems out there and see if we get similar levels of
improvement. And, in fact, the people who have done these sorts of things have never
even done it with a cost factor built in to their learning algorithm. They've always
done post hock thresholding of a normally learned system.
Our own experiments show us that this thresholding is actually strictly worse than
learning with a cost factor for the -- for at least our setting. So it's kind
of -- we should be able to make big improvements. If not on the final speeds, at
least on the filtering numbers for these two problems as well.
So to some up this portion of the talk, proposed principled approach to jointly
optimize the number of overlapping filters. And the kind of secret sauce here is
that each item selects a filter as a latent variable and have demonstrated
improvements on weighted F50 measure on the talk I just gave previously.
So this sums up the discussion. Talked about can we filter links? The answer is
yes. And then we talked about whether or not the links can be trained together.
The answer is yes. And you do see some improvement from doing that extra work. So
a lot of kind of outside help on this due to the kind of different networks that
Shane and I bring to the table. Common link, of course, is Dekang Lin, I guess.
But then again I have to say this meta-optimize thing was really amazing to have
all of these people thinking about your problem for a little while and saved me from
having to become an expert in cost-based learning. And then the whole thing, my
whole desire to visit latent SVMs at all was due to a discussion with Ming-wei Chang
and Vivek about their own work with latent SVMs.
So thank you very much.
Are there any questions?
Go ahead.
>>: So have you tried, I wonder what would happen if you say that all the filters
need to filter something out instead of ->> Colin Cherry: Oh, yeah, instead of picking one, just change it all to an edge?
Yeah. My gut reaction, I haven't tried it, although Shane has made the same
suggestion. So I should do it at this point. It's two strong data points indicating
that this is something worth trying. If we change that or to a conjunction, I suspect
we'll lose the trade-off. We'll wind up trading off only on regularization and not
22
on kind of assigning work to the various filters. That may be enough to just
know -- anyway, it's worth trying, because it's certainly a reasonable alternative.
And it would work in this setting, because it's kind of -- it's not really true,
they don't all have to be on, but we're not upset if they're all on, but it's kind
of the setting we were leaving, because when you train them each independently, then
you didn't get to know, oh, maybe I don't have to handle this one, because it's only
one link is being missed out of a possible N that I would gain, you know, if the
other N minus one are being handled by these other filters.
So it loses a theoretical advantage.
better, because it's still con vex.
Go ahead.
>>: Let's see.
myself.
>> Colin Cherry:
That doesn't mean that it's not going to work
Which is another big theoretical advantage.
So I mean, I have a series of questions.
I'll try to restrain
No, it's fine.
>>: So I guess one question is, Hollingshead and Roark, one of the nice things about
it is at least at a theoretical level, it drops the exponent of the procedure by
one, right? Is there a strong correspondent here? I guess you've tried to be
agnostic of the particular inference system used. But I don't know, can you take
Eisner's algorithm and drop it to [indiscernible] on the N squared work?
>> Colin Cherry: The answer is probably yes. I feel like we've done enough
filtering at this stage that it's probably possible. But they had this interesting
knob they could turn, which is just how much filtering they're doing in general.
Which actually I'll have to go back and read their paper and see how they turned
that knob with two different classifiers, because you'd have to be adjusting two
thresholds at ones. Yeah, I'm not sure exactly how they wind up doing that. But
they wind up accepting possible cells until they hit the point where they're at their
desired level of speed.
So my first thing is I'm a little more nervous about that, doing seven things instead
of two things, I guess. And the other thing is I don't think anyone would notice.
I think we could change the theoretical complexity of inference, like Eisner's
algorithm, for example, and I don't think anyone would notice, because if you -- once
you precomputed those scores, it just flies over the sentences.
>>:
Totally understood, but that 119-word sentence kind of scares me, right like
23
so that's where the end of the third is really [inaudible], you know.
>> Colin Cherry:
Yes, that's true.
>>: And I mean, it feels like there's a lot of things you could potentially do there,
like try to chop the 119-word sentence into regions and minimize the arcs between
those regions or something.
>> Colin Cherry: It's a good point. But auntly, we did not even -- even in my own
parser that I'd written myself and understood exactly how inference is working, I
did not take the time to bother saying, oh, if this link is being pruned, save that
inference. You know, I just, we kind of, we only -- we made -- we intentionally
to some extent only hit feature extraction the whole way through. So I don't even
know, like, forget theoretical bounds, like whether or not I can take it down to
N squared. I don't even know if there's an empirical improvement, but my gut
instinct is that it would be minimal. Like you say, the longer sentences, it would
matter. Go, keep going, sure.
>>: So one of the things you suggested at the beginning was doing sort of combination
between the transition-based and the graph-based parsers.
>> Colin Cherry:
Yeah.
>>: So it seems like one simple thing and maybe this has already been done, I haven't
read the paper is to take the [indiscernible] presentation of the output of the
transition based parsers and re-rank it with the graph-based parsers, right?
>> Colin Cherry: There's a lot of papers with McDonald and [indiscernible] both
as authors. I'd want to go through all of those. So this idea of the combination
came up during one of Shane's trips to Google, and it was something they had been
looking at. And they were kind of at this -- at that point, they kind of just had
written off the graph based parser as just too much of a bottle neck to worry about.
Yeah, it is interesting. I think that anyway, there's been a lot of advances in
dependency parsing in, like, the last year and a half that I'd have to go over, because
I think a lot of these questions like of a packed representation of an output of
a transition based dependency parser. Like Wong's work is relevant to exactly how
efficient that is, and I think it might be a lot more realistic than it was when
we started.
24
And then one final question, I'll cut it off after that.
>> Colin Cherry:
>>:
We can talk later.
I have to run.
>> Colin Cherry:
Okay.
>>: Is so right now, you're doing a hard filtering, could you do something A star
like instead, because you can use this as a prioritization function over links,
right? Again, it means you have to dive into the inference procedure. But if you
rank your links, you know.
>> Colin Cherry: Yeah, no, that would make a lot of sense. So we're kind of getting
into engineering concerns at that point, though, because there is a certain advantage
of just flying over the ranking, the scoring all at once. Where we're kind of
returning to scoring, you know, we do a little inference, then do a little scoring
and do a little inference and do a little scoring. I wonder if we'd wipe out any
improvements we saw. Is this the argument you always get when you look for A-star?
I don't know. Yeah, that's my only concern there, like right now, it's
algorithmically simple to have the scoring over here and it just happens once and
it's just a table of numbers I look up.
>>: [Inaudible]. Access a cell and if it's there, use it, and if it's not
[inaudible]. I don't know.
>> Colin Cherry: I think it's definitely valid and you'd probably, at the very
least, you'd hope that you'd lose -- I'm nowhere near the table, but those little
drops in accuracy should go away at that stage, which would be exciting at least.
So yeah.
>>:
You have a bunch of [indiscernible].
>> Colin Cherry:
always ->>:
So there are conflict of [indiscernible].
Any filters always right.
So filter always wins.
So it's
[inaudible].
>> Colin Cherry: If any of them says filter, we filter. At any of the stages. So
a link only has to be knocked out once. So if anyone, like no one -- so what are
25
we talking about here? Are we talking about like the three-stage process? I
started at the beginning of the talk or are we talking about like the seven filters
at the end?
>>:
The first stage.
>> Colin Cherry: The first stage, everyone only sees the output of the -- in the
cascade, each filter only sees the output of the top, the filter above it, so we're
propagating errors. There's no opportunity to recover.
>>:
[indiscernible].
>> Colin Cherry: Right, okay, okay. So that's, that's kind of this thing here.
So I'm doing, like, ands and ors, and we could do votes.
>>:
[indiscernible].
>> Colin Cherry: I do not have a good answer to that question. It's kind of one
of these moments -- so I think that's, I think that's potentially an important
insight. I'm trying to work through it here. So you have to understand how we
arrived at this is we were kind of -- we started with this and construction, you
know, where any filter has to succeed in order to knock out the link. And then we
kind of formalized it into looking at link-wise decision. But we never really asked
ourselves if we could change this -- I mean, the and slide literally happened two
nights ago. It was like, well, I got to present this to some people and I've got
to figure out a way to characterize this relationship. Oh, it's a conjunction
relationship. Okay, we're golden.
Now that you see it like this, you can always ask yourself, why isn't it plus? You
know, plus would be the obvious one. Or average, or something like that. And
there's no reason not to. Because at test time, we can still, we still fly over
all the tokens, and then as we fly over the N squared, we still don't extract any
features. We just take a quick sum of the token decisions and then make our call.
It's just like, it's very similar like to the question that Christina asked, why
don't we change the or to the and over there? Now that we see it like this, of course
there's a bunch of functions we could plug in they're, and I'm literally seeing it
for the first time. So I guess I should try it. I guess I wouldn't have anything
I could call it.
>>:
[inaudible].
26
>>: So going in the opposite direction, like if you stuck with the ands and ors
there, I mean, it sort of looks to me far more like a sort of a decision tree type
problem than a linear problem. Especially in your first one, we've got this cascade
of classifiers. I mean, it should be just one big decision tree, right? I mean,
you should end up with basically the same result.
>> Colin Cherry:
Yes.
Yes.
>>: There may be some combinations of features you can do there that conjunctions
of features that you can do there that you can't in the linear setting.
>> Colin Cherry: Yeah, the decision tree comparison has come out before. And
actually, once we start doing this latent assignment to filters, and kind of start
Caving up our training set to say oh, these links are handled by this filter and
these links are handled by that filter, at that point it's almost just like a decision
tree, where your first decision is filter choice and your second decision is filter
result.
So there's definitely a connection there, and then again, theoretically, we're about
on the same legs as a decision tree, because it's still non-con vex over here and
whatnot. So yeah, I'd have to think a little bit more about how to set it up, but
I do think it would -- the comparison, there's something to be learned from that
comparison. I'm just, I'm not 100% sure on what it is.
Thanks.
Download