15955 >> John Platt: Okay. I'm very pleased to... assistant professor at Cornell and a well-known machine learning expert. ...

advertisement
15955
>> John Platt: Okay. I'm very pleased to introduce Rich Caruana today. He's an
assistant professor at Cornell and a well-known machine learning expert. He got
his Ph.D. back in CMU in 1997 and has been in various places like Justice
System Research, if you remember what that was, and has been a professor at
Cornell, and he's well known for doing interesting kind of new models of machine
learning such as clustering with hints or multi-task learning.
That was some of your original work. So he's going to talk about model
compression today.
>>Rich Caruana: Thank you very much, John. So I want to thank John and
Chris both for inviting me back for a second time and helping to arrange a visit.
I'm going to be here for the next two days. But I'll be in this building today. So if
anyone wants to try to grab me, please try.
Okay. So I'm going to talk about model compression. This is new work we've
been doing just the last couple of years. This a work in progress. I can't give you
sort of final answers here. And it's joint work with two students at Cornell. Kristy
Busea and Alex Nicuescu. And Alex is just finishing up right now.
And first I guess let me -- whoops. Let me tell you what I'm not going to talk
about. So I'm probably best known for my work in inductive transfer and
multi-task learning. So I'm not going to be talking about that sort of thing at all.
And as John mentioned I've done some work in some of the supervised
clustering and meta clustering. That would be a fun thing to talk about actually. I
almost decided to talk about that and John said there were some people here
who were interested in the compression stuff.
So I decided I'd talk about compression instead.
I won't talk much about learning for different performance metrics. Although,
that's going to come up in this talk. And I won't talk at all about medical
infomatics. We've been doing some fun stuff in the citizen science arena,
learning from very messy citizen science bird data. And it's really fascinating.
We've come up with some cool ways of regularizing models when things are this
noisy. And I'm not going to talk at all about microprocessors. In fact, there's a
person who joined Microsoft Research a year ago, Ingin Nypek, who has been
my collaborator in this work. So if you're interested in that, he's here, so go look
him up.
So those are things I won't talk about. So I'm going to talk about model
compression, but I'm going to start and spend -- it might be the first 10 or 15
minutes talking about something else to help motivate the need for model
compression, because I don't want it to look like an academic exercise. I want
you to realize why you have to have something like this.
So I'm going to sort of talk about Breiman's constant. It's almost a joke, illusion
to Planck's constant in physics. And I'll spend most of the time talking about
ensemble selection which motivates why we need to do this compression. And
then I'll jump into compression and talk about density estimation and show you
some results and then the future work.
Okay. So let's see. Does everybody here have a machine learning background
for the most part?
>>: Mostly.
>> Rich Caruana: Great. So I can skip sort of the machine learning intro that I
had prepared just in case people didn't. And let's jump right to the fun stuff.
So let me tell you about Breiman's constant. So Larry Breiman and I were at a
conference in '96, '97, chatting during a break. And he said something like: You
know, it's weird every time we come up with a better performing model, it gets
more complex.
So when we were talking about boosting and bagging and how decision trees,
one of their beauties had been that they were intelligible, and now that we were
boosting and bagging them, you know, we had even lost this beautiful property of
decision trees, and that wasn't that a shame. Sort of jokingly I said, oh, they're
liked linked variables, Planck's constant.
>>: (Inaudible).
>> Rich Caruana: Exactly. You know, you can't know both the location of an
electron and its momentum, both infinite precision at the same time. If you know
one well, then you inherently know the other less accurately.
I sort of joked well maybe something like this is true in machine learning. And if
the error of the model is low then the complexity must be high and vice versa.
So there's this natural trade-off. And the reason why I put this here, I mean it's
mainly a humorous way to start the talk, but we are going to be talking about this
sort of thing.
Model complexity and how big does the model have to be if it's going to be
accurate. Is it always going to be the case that the most accurate models are
humungously big and all that sort of stuff. And ultimately compression is going to
sort of try to get around this simple statement.
So okay. . Here's a huge table. Don't worry, this table is taken from some other
work I've done. And I'm just going to use it to motivate the need for this model
compression stuff. But I am going to have to explain it. There's a lot of numbers
here.
Let me walk you through it and it will suddenly make sense and I promise there
will be a lot more pictures coming later in the talk that you won't have to look at
these sorts of things.
Here we've got families of models. All different kinds of boosted decision trees, a
variety of random forests with different parameters and different underlying tree
types. A variety of bag decision trees, support vector machines with lots of
different kernels and lots of different parameter settings. Neural nets of different
architectures, trained with different learning rates of momentum, things like that,
different ways of coding the inputs. A variety of neural nets. Every
memory-based learning method we could think of, including combinations of
them.
There's actually 500 different combinations of these hidden under that. Boosted
stumps. There's relatively small class. Vanilla decision trees. There's only a
dozen flavors of decision trees. Logistic regression. Run a few different ways
with a few different parameter settings and also naive bays, run a few different
ways.
So it turns out that this represents, this column, 2500 different models that we've
trained for every problem we're going to look at. So we really went crazy. We
did everything we could think of to train good models on these problems. Then
we went a little crazier. PLT stands for plat.
So some of these models don't predict good probabilities right out of the box.
Others do. Neural nets can predict pretty good probabilities as can K nearest
neighbor. But turns out boosted trees don't tend to predict good probabilities,
neither do SVMs unless you do some work. So John Platt came up with a
method for calibrating models. You first train the model, then apply Platt
calibration as a post calibration step to improve the quality of the probabilities.
It turns out that works very well. There's also a competing method isotonic
regression, which also is good in other circumstances. It turns out John's method
to just summarize a piece of work we did a couple of years ago, John's method
works exceptionally well if you have limited data which you often have when you
do this calibration step. If you have lots of data, the ice tonic because it's
ultimately a more powerful class of models, may work better if you've got lots of
data.
But if you have little data, you should stick with Platt's method. Star means you
didn't need to do any calibration.
So we've taken this 2500 models and now we've calibrated them all using Platt's
method, using isotonic regression and not using any method. So in fact this
becomes 7500 models, if you do the three different calibration methods.
Okay. So we have 13 different test problems that we're working on. And what
we do is we do everything we can to train a good boosted decision tree on every
one of those test problems to get the accuracy as high as we can.
Question?
>>: What's the size of the problems?
>> Rich Caruana: So all these problems have dimensionality, about 20 to 200,
and it turns out the train sets we've artificially kept the train sets modest at about
four to 5,000 points. Even though on some of these data sets we have 50,000 or
more points.
And that's the sort of to make the learning challenging and it's also to make the
experiments computationally feasible. Do you need more information or is that
good?
>>: Is this coming -- based on the data set?
>> Rich Caruana: That's a good question. So half of these are from UCI
because you sort of have to use UCI so other people can compare and the other
half are things for which I really have collaborators who care about the answers.
Because I don't believe in sort of overfitting to UCI. And, of course, we put in
more effort into the ones we have real collaborators for. So that's a good
question.
Okay. So what have we done? I'm just summarizing other work here. This is
really still motivation. But if you don't understand what the numbers are, it won't
help motivate things.
So what have we done here? We've trained as many basted decision trees as
we could on each of those problems and using a validation set we picked the
particular one that's best for each problem for accuracy. There's five-fold cross
validation here. So these numbers are fairly reliable.
Then we've normalized these scores so that for every performance measure, no
matter what it is, even for like squared R, we've normalized it so that one is truly
excellent performance. Nothing really should be able to achieve one because we
had to cheat to get that high quality.
And zero would be baseline. So hopefully nothing is doing near baseline. So
basically just think of these as a number near 1 means really, really good.
>>: Are these binary classifications?
>> Rich Caruana: They're all binary classification problems. In fact the few
problems that weren't binary classification problems we turned them into binary
classification problems. That's why we can do things like AUC, the end of the RC
curve. Thank you for asking that.
This is the average performance we could achieve with a whole bunch of
boosted decision trees with and without calibration on the 13 problems, when
we're doing model selection for accuracy. And this is the average performance
we could get when we're trying to optimize for F score, which you might not be
familiar with, but don't worry it's not going to be important for the talk.
And this is the best we could do for lift. And this is the best we could do for area
under rock curve. Best we could do for average precision. The precision recall
break-even point for squared R and for good friend log loss.
So that's the best we could do with boosted trees. And this is the best we could
do with random forest. So remember these are averages over many
experiments. Different models are being picked for different metrics and for
different problems, but they're all boosted decision trees in the first row. They're
all random forests in the second row. Different neural net architectures in this
row. But what is best for each problem.
So this is sort of the big average picture of performance. And looking at different
metrics. And then this is the average. This last column. So that's the average
performance across all the metrics.
One advantage of normalizing scores in the way we've done it is the semantics
are pretty similar from score to score and from problem to problem so it actually
makes some sense to average across them this way.
So that's our real motivation for having done that.
>>: As far as the error wise is that just to (inaudible).
>> Rich Caruana: That's a good question. It turns out the differences in this final
column of about .01 are significantly significant. Here we're not averaging over
quite as many things. So .01 is possibly significant. So sometimes it is and
sometimes it isn't.
It's difficult, by the way, to do real significant testing. We've got five fold cross
validation under the hood. And it turns out the performance on the different
metrics is highly correlated, if you do well on some metrics you do well on the
other metrics.
It means they're not truly independent. So it's hard to know anyway. So when I
bold things, I mean that we've sort of looked at the numbers behind the scenes,
done some sort of simple T tests and if we have bolded them that means you
should view those things as statistically identical. It's not a reliable test. So it
doesn't mean that every time things are bolded they are indistinguishable and if
they're not bolded they are distinguishable, but I just wanted to guide your eye
because we can't scan big tables of numbers like these. In fact, if you want to
focus on the mean that would be fine. Although, these other columns will be
important later on.
What have we got? We've sorted things by overall performance. So boosted
decision trees are sort of at the top of the table. But if you look at this mean
performance over here, it will turn out this is a three way tie, really, for first place
in mean performance.
So boosted trees, random forest and bad trees are all doing exceptionally well
across all of these metrics and across all these problems.
>>: Limited (inaudible) for the boosted trees or did you ->> Rich Caruana: Ah, so we tried every kind of tree we could, including lots of
different parameters for the trees. So we grow full-sized trees and boost them.
We grow reduced trees and we boost them.
We also do something with boosting that some people do and don't do. We do
early stopping, which is we boost one iteration two, four, eight, 16, 32, out to
2,048. And the validation set is allowed to pick whatever iteration works best. It
helps boosting a lot by the way. It wouldn't be the first place in this table if you
didn't do that.
>>: You don't do that?
>> Rich Caruana: Surprisingly, some people don't do that. There's one problem
here, by the way, where boosting starts overfitting on its first iteration, which
means that the validation set prefers iteration one, which is before any boosting
has occurred.
And every iteration after that actually goes downhill on most metrics. So it turns
out that boosting would not be doing this well. Boosting is a very high variant,
risky method. But when it works, it works great. Random forests are doing so
well for a completely different reason. They're sort of reliable time and time
again. They never really grossly fail. And, similarly, bagged trees are pretty
reliable.
Okay.
>>: This may be philosophical. But you mentioned you have a three-way tie for
first place; isn't that more like a five-way tie?
>> Rich Caruana: Yeah, so it turns out that these are statistically distinguishable
as far as we can tell. No matter -- we've done bunch of bootstrap analysis with
the data and we always get this sort of same picture. Maybe two of these things
would change places but we never see these things move into the top or those
things move down.
So we tend to believe that SVMs and neural nets really are in this sort of tie for -it's not second place, it's third and a half place. Four and a half place, I'm sorry.
By the way, so don't overfit to these results. These are all problems of modest
dimensionality, 20 to 2200 if you're doing bags of words. In fact, we've repeated
this kind of experiment in high dimensions. The story is quite different in higher
dimensions.
If you want I can tell you about it sometime. As you expect the linear methods
really come of their own once the dimensionality breaks 50,000 or 100,000. But
for the world where you've got say 5000 training points binary classification and
modest dimensionality this seems to be a pretty consistent story.
People tell us your story is like this because you forgot to include data sets that
had these difficulties so then we added those data sets and the story didn't
change. So, okay, some interesting things.
Let me come back to that in a second. Interesting things are the top of the tree is
ensemble -- I'm sorry, the top of the table is ensemble methods of trees. So I
didn't necessarily expect that when we started doing that work. So that's kind of
interesting and that itself is going to be part motivation for the need for the
compression work.
>>: I meant to ask you, you never tried an element that's of logistic regression?
>> Rich Caruana: No. Although we have done ensemble methods of
perceptrons, inverted perceptrons. And they do pretty comparably especially in
high dimensions to logistics sometimes. We've done ensembles of those. It's
not here.
They don't do as well as these. Except in very high dimension. Very high
dimension, the story is quite different.
In fact, logistically regression all by itself in extremely high dimension starts to
match the performance of the best of these. Boosted trees it turns out don't hold
up very well in very high dimension. To our surprise, random forest just take high
dimensionality in stride and do extremely well in high dimensionality.
HATS (phonetic) also do surprisingly well in high dimension. I didn't anticipate
that. I thought they would also have trouble. SVMs, because we include linear
SVMs in our mix of SVMs, they also do well in high dimension because you end
up picking the linear kernels for SVM.
>>: (Inaudible)
>> Rich Caruana: No, in that case, in very high dimensions, you sort of have to
have more data. We're using the natural size of these other 12 test problems
that we have. So in some cases it's as much as several hundred thousand
points in the train set. Good question. With 5,000 points in the train set. I don't
know if 100,000 dimensions sets would mean much. Very good.
The tree methods are doing very well. You might think -- might look at this table,
great, nice answer for machine learning. As long as we use one of these
ensemble of trees we've got our top performer, works well across all sorts of
metrics. Great. You'll be sad if you're using an SVM aficionado because it didn't
quite make it to the top. Turns out that's not the conclusion you should draw.
We're adding one more line to the table.
Remember, this is the best we could do with boosted trees. This is the best we
could do with random forest. This new line, we're just agnostic to the method, we
take the 7500 methods we've trained down here and we just use the validation
set to pick the best one.
So this gets to use anything it wants. And the surprise to me is not that it's
better. You'd expect it to be better. It's how much better is the surprise. So
remember this was like a three-way tie for first place and then these guys were
sort of coming in second.
Well, the differences here are dwarfed by this difference. So this tells you that it's
not the case that boosted trees or random forests or bag trees are always
consistently one of the best models.
The only way to get that large difference is if occasionally some of the models
down here are the models that are best. In fact, when you look at the details you
see that. You see there are problems for which logistic regression which
average performance is not so high, logistic regression is the best model by a
significant amount on this problem and if you didn't use it, you actually are going
to lose significant performance. Because these models don't do it on that
problem.
And there's another problem for which boosting fails miserably, random forest,
none of the tree methods, truthfully, do very well. But it turns out boosted stumps
do extremely well on this problem. And if you weren't looking at boosted stumps
you would have trained an inferior model.
>>: Just to make sure. The procedure for raising that first row is you're taking
the validation set dividing it by the model and that's the number that's set?
>> Rich Caruana: Yes, thank you, right. That's -- that's right. So we train 7500
models. We use our held aside validations, pick the one that looks best for
accuracy on each problem. And then on the big test sets that we have we report
the performance and convert them to standardized scale. Thank you. I should
have said that.
>>: (Inaudible) but is it the same (inaudible) if you do the same thing (inaudible) I
would think that you would not only fit near as much.
>> Rich Caruana: Right. Right. The differences between methods start to
become less in very high dimensions. And the linear methods catch up with all
the other methods. And then a few methods clearly break. And boosting is one
of them. So boosting just overfits dramatically in very high dimension, unless
you've done something to regularize it.
>>: What about the much later training set, is it the same?
>> Rich Caruana: Good question. The high natural experiment we've only done
with the natural size of the data sets as other people experimented with them.
And for these data sets we've only done experiments with the sort of 5,000
training sets. So I've never actually been able to do the learning curve.
The experiments are expensive. It takes us several months to create this table
and in high dimensions it took even more time. We had to write a bunch of
special purpose code in fact to do the high dimensional experiments. It just
wasn't that easy to train some of these things on that high dimension.
Sure.
>>: (Inaudible).
>> Rich Caruana: Which single model?
>>: Yes.
>> Rich Caruana: So it does turn out that the best single model on average is
boosted trees.
>>: The very best one is (inaudible).
>> Rich Caruana: Oh, oh, I'm sorry. So this thing could be two neural nets on
two problems. It could be three boosted trees on three other problems. It could
be random forest on four more problems. It could be logistic regression on one
problem and boosted stumps on another problem. And that's for accuracy.
Then when we go to RSC area, it may prefer very different models. It might
suddenly prefer neural nets and K nearest neighbor. In fact, K nearest neighbor
and neural nets do quite well on ordering things like RSA.
So there's no easy answer to that. But the important thing is that it's not the case
that just sticking with a few best methods is safe if you really want to achieve the
ultimate performance. You actually do it, sadly, it looks like you have to sort of
try everything and then be a good empiricist and use a validation set to pick the
best thing.
So that's the take-away message from this. Okay. So now we're going to make
that a little worse. Now, I've just said that if you really want high performance,
you have to train 7500 models or 2500 models and calibrate in different ways
and use the validation set to pick the best. Can we do something better than just
picking the best? We all know what ensembles are. Can't we form an ensemble
out of these many of the models we've just trained and possibly do even better
than the single best model in that set.
And you all know that as long as we have a bunch of classifiers, many of which
are accurate, and hopefully they're different from each other, diversity in the
models, there's a good chance we'll be able to form an ensemble that's better
than any one of those models.
Lots of ensemble methods around. In fact, we're using some of them in the
table. There are things we're not using, error correcting codes. There's a lot of
things we can just do. We can take an average of all those models. We can do
bayesian model averaging where you take an average but now weight it by the
performance of the model. So higher performing models get more weight.
We could do stacking, which is trying to learn a combining model on top of the
predictions of all those models using the validation set as the training set for the
stacking model.
All these ensemble methods really differ in just two ways. One, how are the base
level models generated and, secondly, how are those models combined?
So the base level models are generated by the process we just described. We're
not going to try to build an ensemble out of those things. We basically train every
model in the kitchen sink, just train everything, everything that makes any sense
that we can afford to do and we keep it.
Now we're going to try to combine them in different ways. So let's do that. So
I've pruned the bottom of the table, but here's the top of the table. Here's the
best line, which is all before -- you can now think of best as an ensemble. It's the
ensemble which puts weight one on one of the models and weight zero on all the
other models. It's a funny ensemble. You can't expect it to do better than the
best model since it is the best model.
>>: Stacking ->> Rich Caruana: So stacking, what we've done is we've tried to use, say,
logistic regression to combine all the predictions. That didn't work very well.
Then we tried using SVMs to combine the predictions. We had a lot of trouble.
You'll notice the performance is not very good. We had a lot of trouble because
the 7500 models are all very correlated with each other, and our validation set is
modest. It's only a thousand points that we've held aside for validation.
So the stacking always overfits dramatically. Now, of course, we can set
parameters for stacking so it does something like just take the average of
everything. We can tune stacking that way. But we wanted stacking to have the
freedom to hang itself if that's what it was going to do.
And we didn't force it to do average all, because we have that as our own
separate line in the table. This poor performance we got with stacking, and base
averaging by the way does -- this is not statistically meaningful just epsilon better
than picking the best single model.
There's different ways of doing bayesian averaging. We're just exploring one
particular approach there. And now that we've made this work publicly available
other people have come to us and suggested other approaches.
We think that our results with both stacking and bayesian averaging are not the
best that can be achieved. So we're confident if we spent some more time -- we
put a fair amount of effort into it. We were surprised they weren't better. We
thought they just would be right out of the box.
We put some time into it. But we were surprised with the difficulty we had getting
improvements from these things. But we're sure that there are ways to make this
work. It's just that in this sort of month that we spent trying to make it work, we
didn't hit on it.
It will turn out for the model compression story I'm going to talk about soon, it
doesn't matter. If you can make one of these things work, that's great. And
you're still going to need the model compression that I'm going to talk about.
But what we're going to do is just create our own stacker to combine these
predictions, and we called ensemble selection. Because we were disappoint that
those methods didn't work. We really thought there must be a way of getting
even better performance out of this large set of models. So we just quickly tried
something, and it sort of paid off right away.
So let me describe that to you quickly. I won't go into the details of this. Train
lots of different models using all the parameters and stuff you can. You've
already seen that we're doing that.
Just add all the models to a library. Don't throw any of them away. No pruning of
models or anything like that, just keep them all around. And then we're just going
to do good old forward step wise selection.
If you've done forward step wise feature selection, which has been around for 50
or more years, we're just going to do forward step wise model selection. Just
one at a time we'll add models from this collection into the ensemble in an
attempt to keep making it greedily hill climbed toward better performance. Let
me walk through that. Here's 7500 models. There's our ensemble, which we
start off with nothing in it. And we've been asked to build an ensemble that
optimizes area under the RSC curve. That's our job.
Nice thing about this method is you can optimize to any performance metric as
long as you can calculate it reasonably fast. Here's the RSC of each of these
models individually. We find the one that has the best RSC and we put it in the
ensemble. Now the ensemble is now just the best model. Now it's equivalent to
that best line in the table in fact. It's not an ensemble yet.
Okay. So that model is now in there. Now we go back to the remaining models
and we figure out what would the RSC be if this model, Model Five, were to be
added to Model Three and their predictions averaged?
So it turns out to be .9047. That's not an improvement. 9126. That's not an
improvement. Whoop, that looks pretty good. Better than what we've got up
there. 9384. So we find the model of the ones that are left that's best to add to
the ensemble and we add it.
>>: Can you write it a different way?
>> Rich Caruana: You can imagine different weights. We find it every time we
make it too flexible it overfits. If we had 10 or 100 times more data the world
would look quite different. And we were able to get mileage out of this very
simple method. The only thing we do end up doing is we actually do I'm not
showing it here, but we do greedy selection with replacement. Which means a
model can be added two or three times. It turns out that's important for reasons I
won't go into. And when a model is added three times it does get three times the
weight of a model that's added once.
So we do let it very crudely adapt the weights. Okay. So we go back to the well.
We find the model that when added to the ensemble would make it best. We just
keep repeating this until things stop getting better.
Okay. . Now, the more models you've got, the better you have of finding a
diverse set of high performing models that actually work together in a way that
gives you high performance. That's great. But this overfitting is really a killer
when you've got this little data. We've had to come up with a number of tricks to
mitigate this overfitting. And I am telling you this because it's so natural to want
to go and try this, if you've tried a bunch of models. And you might not realize
that over -- if you don't control overfitting, you'll actually do worse than if you
didn't do this in the first place. You're better off picking the best model than trying
to greedily form an ensemble if you don't control overfitting. This is critical.
That's the subject of another paper. We won't go into that. But you have to do
something to take care of it. We have some hacks doing it. It works well. Here's
how well it works. There's the ensemble selection added to the top of the table
and I've rebolded all these the entries. I think you can see ensemble selection
has gotten -- remember there best line was really -- I mean these were pretty
good performance down here. These are some world class models. And we're
trying every variation of them under the sun that we could afford to try. And
picking out all the best things down here was much better. There's yet this other
big increment by taking this ensemble of those different models.
So that's nice. I mean that's what we're hoping to see, with some effort we
actually got that. So that really is -- I mean in some sense it's beyond state of the
art performance for simple machine learning.
>>: Do you think, the principal, the bayesian average for any weighting scheme
could have achieved the ensemble such that they were putting in zeros
everywhere else.
>> Rich Caruana: Exactly.
>>: But, in fact, it doesn't because overfitting of the small DEV (phonetic).
>> Rich Caruana: That's exactly right. There's something about the greedy step
wise selection process with our overfitting controls that makes it a more effective
algorithm, than these other methods which should have been able to do it but
somehow couldn't.
>>: You're not guaranteeing that, or are you, that the rise would be (inaudible).
>> Rich Caruana: No, no. In fact the graphs are quite noisy. Turns out that
sampling with replacement makes them better behaved. It's one reason we do
that. It turns out, by the way, occasionally this thing overfits and in fact you
would be better off for some problem in some metric just picking the best model
and this thing is epsilon worse than that.
It picked the best model and put it in first and then it made mistakes afterwards.
But on average, across many problems of metrics, its performance is quite, quite
good.
>>: Your validation set was larger, would the results still hold?
>> Rich Caruana: If the validation set was larger, I think the results would still
hold. But stacking and bayesian averaging, I think, would be doing much better.
Yeah. They would be competitive. In fact, it's possible they would outperform
this. Bayesian averaging, I tend -- we can talk about this afterwards. Sort of a
philosophical question. I tend not to think of bayesian averaging as being much
of an ensemble method as being a hedging-your-bets method.
So but we could talk about that later. It would be fun to hear what your opinions
were on that.
>>: Have you tried to expressly incorporate diversity.
>> Rich Caruana: That's a great -- we did. And we read papers on how to
calculate diversity between models. And the funny thing is we never got any
mileage out of it better than ensemble selection. In fact, it was hard to duplicate
the performance. If you think about it. By greedily selecting the next model to
add to the ensemble to make its performances, in some sense it's implicitly
thinking about diversity even though it has no explicit measures of diversity.
It is going for the model from the large set that, when it adds to the models
already there, sort of maximizes performance. The odds are they're diverse.
Turns out, if you look at the models that get put into the ensembles it's really
fascinating.
It never, ever, not on a single problem or metric, sits there and chews up 90
percent on one model class and just adds a few others. It's not like that at all. It
pulls in 23 percent of this class and 16 percent of that class.
And it turns out your after squared R, throws in a bunch of neural nets. It's very
interesting to see what it likes to use. It's just fun. You can spend days just
looking at the ensembles and sort of telling yourself stories about them.
>>: Did you ever try using the original training set for this ensemble selection
instead of a validation?
>> Rich Caruana: Yeah, and it just fails terribly. Turns out the performance in
the models can be so good on the training set that it just fools itself right off the
bat. Yeah. Yeah. You have to have the independent validation set or else it
really does bad things.
There are tricks you can do with five fold cross validation so you ultimately get to
train on everything and validate on everything. It takes more effort to do it. But it
does work.
>>: Do you have any insights as to what it is about these data sets that prefer
(inaudible)? Ideally you'd have some simple method that says now (inaudible).
>> Rich Caruana: Yeah, yeah.
>>: We're kind of anxious to find out.
>> Rich Caruana: We did get some insights. But I wouldn't want to -- they're
almost the kind of insights you would have before you looked at the results.
Realize, we only have 13 data sets. And we don't have any well-defined way of
characterizing them. So we have a small sample size in that sense.
But you do see things like data sets that we knew were noisy, because of our
previous experience with them, those are the ones boosted trees do not do well
on. Although, random forest still do quite nicely on them. Bagging does quite
well and boosted stumps do quite well.
So things that you would expect sort of happen. Things like that. But I can't say
we've had too many eurekas. Like, oh, if the ratio of nominal attributes to
continuous attributes is greater than .5, then you should be using memory-based
learning.
Sadly, if we could do the same sort of experiment but now with hundreds of data
sets, now we could actually do even sort of machine learning to try to take the
characteristics of the data sets and predict what model would do well on it and
what would do poorly. That would be fascinating.
But we'd have to increase an order of magnitude or two orders of magnitude to
be able to even touch that. So it's a great challenge, though.
So this thing really works. It's kind of nice. This really is phenomenal
performance. If you've got a limited amount of training data and you want great,
great performance, and every little improvement you get makes a big difference.
It either saves a life somewhere or it increases your bottom line by a million
dollars, I mean this is a good technique.
Okay. So it really works well. It's interesting that we haven't hit the supervised
learning ceiling yet, right? I mean with some fairly simple stuff like training a
bunch of models, picking the best and forming a greedy ensemble out of them
we're really upping the performances that we see quite a bit.
It's cool. You can optimize these things to any performance metric. That's one
reason why they work well. We don't know how to train neural nets to any
performance metric but we can optimize the ensemble to any metric, so that's
kind of nice.
It works even when some of the base level models aren't good on those metrics,
because it can still perform combinations of models that work well on that metric,
even though no one of the models work on that metric.
So it has some room to sort of improve over the base level models can do. And
then there's some nice things like it takes a long time to train all those base level
models. But forming the ensemble just takes seconds on the laptop. It's actually
the greedy forward selection. It just flies through that. And that's because we
cache all the predictions.
It's actually quite fast. And there are cool things you can do. Like you can
imagine that the world changes. You have some new small sample of labeled
training data. You don't retrain all the base level models. Maybe you don't have
enough data to do that. But you could redo the model selection to build the
ensemble. Just like that. You could do that every ten minutes, if you wanted to,
if it was the stock market or something. You'd have to labeled data, new
validation set to do it.
So this has some nice properties. So this is good. Here's a really big problem,
and this is the -- this is all motivation. (Chuckling) so there was a really big
problem, and the problem is that these things are big. Way, way too big. So
think about it. Some of the base level models are things like boosted trees and
random forests and bagging and K nearest neighbor.
So I just pull an ensemble out. And this ensemble had 72 boosted trees in it.
Each boosted tree can have either a thousand or two thousand trees in it. Right?
So 72 boosted trees turn out to be 28,000 trees.
It had only one random forest. This one didn't like random forests. But even that
had another thousand trees in it. It had five bagged trees. So that was 500 more
trees. It was a lot of trees in here. It had 44 neural nets of different sizes. So
that was a total of 2200 hidden units that got added.
So it's a large number of weight. It had 115 memory-based learning models in it.
And it had a bunch of SVMs, different ones, RBF kernels, you know, whatever.
Boosted stumps, you get the picture.
So it's a big thing. This particular one takes about a gigabyte to store all those
models. Gigabytes are still big things. And it takes almost a second to execute
that huge number of models to then take the average prediction to make a
prediction for one test case.
>>: (Inaudible) sorry.
>> Rich Caruana: I'll show you typical results, yeah. But it's not atypical. It's
slightly larger than typical because it makes it more fun.
But it's not that atypical. Good question.
Was that your question as well?
>>: Yes.
>> Rich Caruana: [Laughter] okay. And I just want to convince you this really is
a big problem. I mean so think about web search. What do you get? A billion
queries? A week? A month? Whatever? You're not going to execute this thing
a billion times.
I mean you're not going to even execute it a billion times to cache the results. A
billion is a lot. So the test set is large. If you really want to use the thing in real
time, you haven't got a chance. It is longer than the delay that's allowed when
the answer has to get back to the user.
And you can try to parallelize it. If I had 30,000 models in there, maybe you can
dedicate 30,000 machines, one to each model but the communication costs -- it's
just not going to work.
So not going to use it for web search. You're not going to do face recognition by
scanning your little box across your image at multiple scales to try to find faces,
right? Because, again, the test set is going to be too big. If you're trying to do
this on video it would be even worse than still images.
Satellites, think about it. The memory they put in satellites is still like PCs of a
decade or more ago. Right? So it's hardened memory. So it's small. Means
you can't fit these models in memory anymore. God knows you don't want to go
to secondary storage to get to them.
You can't do that. The processors have very little speed, very little power. Never
going to put these on a satellite. Not going to put it on a mars rover. Power
considerations alone would prevent you from putting it on the mars rover.
The same thing for hearing aids. You could spend a lot of money on hearing
aids. People pay money to hear. But you're just never going to have the power
there unless you start putting little nuclear power plants in their ears. PDAs and
cell phones, now cost is a real issue. Maybe PDA's are getting powerful enough
that they could start to do this. Depends on what you're using it for. If it's trying
to figure out where you're going or what restaurant you're interested in, maybe
you could afford it. But then the cost would be prohibitive.
These things are at the margin an extra 10 cents on the cost of a phone is
significant. Digital cameras. You've got the picture. You aren't going to be able
to use these things for a large number of applications in which you would like to
use them.
>>: I'm curious, if you sort of prevented it from getting large artificially what the
hidden performance is, whether you needed that size or if you can still ->> Rich Caruana: We did try that. And there is work on taking ensembles and
trying to prune them to sort of make them smaller. And we never had that much
success. And we think the reason why is the diversity is really important to why
the ensemble works so much better than the individual models.
And as we started to sort of -- we can cut them in size by half and get ultimately
the same performance. But half isn't enough to make a difference. We can't cut
it in size by a factor of 10 or 100 and keep the same performance, because the
diversity seems to be so important.
Okay. So this really is a nasty problem. All these applications are sort of out the
window. In some sense you might say this is Breiman's constant come back to
haunt us, right? It's true. If you want low error, you're going to pay the
complexity price and just too bad. You can't have low error on these
applications. It's just the way it is.
Well, that would be a sad day. If that was the answer, then I would stop now and
take questions. So we're going to have an approach to this which is model
compression. And what we're going to do is you train that model any way you
want. We'll use ensemble selection. We'll train a simpler model to mimic that
complex model.
The way we're going to do that is a trick that people have used for other reasons.
We're going to pass a bunch of unlabeled data through the complex ensemble
and collect its predictions. It's unlabeled data. We don't know what the real
answer is. We get the prediction from this ensemble because we want to mimic
that complex model.
We let this thing label our unlabeled data. We now have this very large
unlabeled data set, and now we're going to use that large unlabeled data set to
train what is hopefully going to be a smaller sort of mimicked copy cat model. If
we're successful with this thing it will look just like the function that was learned
by the complex thing but it will do it in a smaller package.
So that's the game. Let me just show you right away one of the results to show
you that this is possible to make sense. This is squared R. So down is good.
This is the number of hidden units. We're using neural net as mimicked model
here. Neural net is trying to mimic a hidden ensemble. This is the number of
units in the neural net. We've done the tricky just described.
We've taken a bunch of unlabeled data and labeled it with the ensemble. The
performance of the ensemble is this squared R. It's very good. It's an excellent
model. The performance of the best neural net we could train on the original
data was this good.
It's not a bad model, but it's not that good. That's the model we want to deploy.
That's the best neural net we knew how to train. This is the best neural net we
can train using this sort of compression trick. And you notice we get really close
to the performance of the ensemble with -- this is a log scale with a neural net
that has sort of 32, 64. It's a modest sized neural net. It's not a billion hidden
units down here.
This thing is nowhere near the complexity of the ensemble. Depends on how
you measure complexity now.
>>: Can you give us a sense? Is it a thousand times lower than (inaudible)
what's the gap in the complexity?
>> Rich Caruana: It's several thousand times faster, and several thousand times
smaller. Yeah. So I'll show you some tables of those results.
>>: Were that the case -- was it as complex as the one you showed?
>> Rich Caruana: This is actually an average over several data sets. I wasn't
going to tell you that. But it's actually an average over several data sets. So
there's no one ensemble.
>>: Is it a consistent lift?
>> Rich Caruana: No, in fact, I'll show you some results where that doesn't
happen. No, it's a very good question. And if we had enough synthetic data, you
would probably never see a lift because the neural net could never overfit.
>>: But you could (inaudible).
>> Rich Caruana: Exactly. It just takes time. It just takes time to make the data
and then time to train the neural net. So we're always trying to sort of do this
game with as little data as possible.
>>: (Inaudible).
>> Rich Caruana: No, no, you're absolutely right. We'll buy the extra cluster and
spend the extra month. I agree completely.
>>: You're trying to mimic, trying to mimic the dirty outputs or (inaudible) as well.
>> Rich Caruana: Good question. Just the output. I should tell you my personal
approach to, this is still all boolean classification. My personal approach is
always to predict the probability of something being in the classes so I have
continuous numbers coming out of these models.
We're actually trying to mimic the continuous numbers. Yeah. So, thanks.
So that's a preview of things to come. That's where we're going. Now, we'll just
talk about this process and when it works and doesn't work and how to make it
better.
Okay. Why neural net? Well, we tried it with decision trees. I've seen some
large decision trees in my day. But never this big. They had to be humungously
large before with high fidelity they could model the ensemble. We had to keep
modifying the code so they could keep training bigger trees. You have another
problem, recursive partitioning keeps running out of data. You end up having to
have large synthetic data sets so you can grow very large trees in the first place.
So it's painful. It's not very good compression because the tree is huge. You
need a huge amount of data to do it. It wasn't a win in our experiment so far.
Support vector machines are a little more promising, and on some problems it
turns out you can get a reasonable performing support vector machine.
On other problems, though, the number of support vectors you needed in order
to get high performance, just sort of went exponential. We just needed some
huge number. Now, you could use a trick like Chris Burgess's trick way back
in '96 for basically pruning away the least important support vectors and trying to
minimize that.
We haven't done that. So, hey, everybody knows that K nearest unable is bay's
optimal if the training set is large enough. No reason to think it wouldn't work well
here. The problem is that the training set would have to be extremely large. But
there might be tricks with good kernels and clever data structures and pruning of
the training set, things that people have all done research on, that might make
this feasible in some applications.
But it's not going to be feasible sort of simple right out of the box. You'll have to
do a lot of work, I think, to make it feasible. The neural nets, we tried these
things in the neural nets we just tried them and it was like wow. We didn't have
to work to make them work.
So we stuck with the neural nets. It doesn't mean that we would ignore these
other things. There are times as you're going to see every now and then the
neural net has trouble and we'd have to use something else.
>>: Did you take any care with the input density to launch the problems you
originally trained on and how sensitive was it?
>> Rich Caruana: Yes. Yes. So that's exactly where we're going.
>>: These are very dimensional?
>> Rich Caruana: This is all the 20 to 200 dimensional problems. I don't know
how well -- our method is going to depend on density estimation to create the
synthetic data. And I don't know how well our method would do in very high
dimensions since the world really changes up there.
>>: Did you ever compare this to like the deeper (inaudible).
>> Rich Caruana: No, I have that in the future work slide that I know we'll never
get to. [Laughter] but we haven't done that. And it's definitely a work in
progress. I see this as being halfway there. We've got that first really promising
proof of concept. It's actually already useful for a whole number of things. But
there's a whole lot of additional work you'd like to do to say you really understand
it and to know when it's best and to know when you should do something else.
>>: Do you think it's critical that you're only asking for one of these parameters
that you're using?
>> Rich Caruana: No, we (inaudible).
>>: Okay. So this is a generic, you don't even know the classification, it could be
this type of thing?
>> Rich Caruana: Can be used to mimic any function.
>>: Or anything in the vendor space? (Phonetic).
>> Rich Caruana: Conceivably. Yeah. I think it's completely generic. There's
nothing we've done and there's no result we've had that makes me believe it's
sensitive to any of those details.
>>: Does the previous graph really say that the whole thing was just a question
of how much data you had to train in the first place and nothing about the
diversity ->> Rich Caruana: I think this is a good interpretation; that if we had had a million
labeled points to begin with, we would have gotten a neural net this good or even
better in the first place. But we never had that data to begin with. And for some
reason that prop and all the variations that we've tried aren't capable themselves
of learning this good of a model from the limited training set. So we have to go
through this really awkward neural net training algorithm, which first trains a
whole bunch of other things and then builds an ensemble and then ha Lou's nets
some synthetic data. Labels it with that ensemble and trains the neural net.
Turns out that neural net training algorithm is the quite effective.
>>: You have the million data points do you still think you would get with the
ensemble it would be 10 million?
>> Rich Caruana: That's a good question. So one thing we don't know is how
much extra lift we would get from these ensembles if we were really in the data
rich regime. And there's a chance that just good old boosting, which is such a
powerful learning method, if you can feed in enough data to prevent overfitting,
there's a chance that good old boosting is going to hit the asymptote and nothing
is really going to be able to best that. That's what I believe. But I don't know that
for a fact.
>>: There's nothing in the method that says you must use synthetic, right?
>> Rich Caruana: Right, right. It turns out if you're in a world where you have -I'll show you some results which show that when you've got that, you're golden.
That using the synthetic data hurts you. The goal is to make it not hurt you so
much that it makes it impossible.
>>: You can view the (inaudible) themselves as the output (inaudible).
>> Rich Caruana: Absolutely.
>>: And just kind of ->> Rich Caruana: That's actually right. I mean all we're doing is we're just
training the neural net on a large training set. Now, the large training set
happened to come from this other model whose performance we really envy
(chuckling) so we did this sort of complex machinations to sort of extensionally
represent the function and capture it in this training set and now we're just
training a neural net to do it.
And that's it. Anything that will give you that large data set. I mean, anything that
will give you a large, good training set, you'll be able to train a good model out of
it. And it probably wouldn't have to be a neural net.
So these are all good questions. And in fact I'll skip through slides later on,
because we're hitting all the points now, which is perfectly fine.
Okay. So neural nets are good. We're getting surprisingly good compression
with the neural nets. I'll show you some results. And the execution cost is low.
You know, these are all one hidden layer neural nets. It's just like a matrix
multiples. It can be parallelized. They're building hardware to do this a decade
ago. So you can make this as fast as you want. It could be a small part of your
chip in your cell phone. It's trivial stuff. It's expensive to train the nets because
they'll train on maybe hundreds of thousands of millions of points. And any
intelligibility you had, we probably lost it when we put it into a neural net. The
truth is if you started with ensemble selection, well, you didn't have intelligibility to
begin with?
>>: (Inaudible) to train with.
>> Rich Caruana: I think the stuff only goes up to 400 K. But there's clear
evidence that we need to be going over a million on some problems.
>>: Small parts ->> Rich Caruana: Say it again?
>>: For training.
>> Rich Caruana: So we get pretty good results, let's say 500,000. But for some
problems you'll need 5 million.
>>: How long does it take to train?
>> Rich Caruana: Oh, it took us -- we tend to use good old slow back prop as
opposed to faster second order methods because sometimes we don't get as
good results with them. They seem to more likely overfit.
So it's definitely days. And in some cases weeks. But it depends on the
dimensionality of the problem and how hard it is to work.
>>: What's the size -- this is a 200 (inaudible).
>> Rich Caruana: It's interesting. I'll show you some results where there's some
problems where the eight hidden network just does it. There are other problems
where you need hundreds of hidden units and they're the ones that are harder to
learn. It's very interesting.
I'll show you the results. If I don't satisfy you, definitely come back and ask me
again. And the neural nets, some applications like web, real time, they may still
not be fast enough. It could be logistic regression and its cousins are going to
sort of dominate in the web world for a long time, just for many reasons.
We've got this new problem which we've already talked about, which is where is
the unlabeled data come from, if you've got tons of unlabeled data as you tend to
have in text applications, web applications, image applications, that's great.
I'll show you some results that shows you that's the best thing you could have
often you don't have it. Often all you've got is that original labeled training set,
you know, which in our case is 5,000 points and you've just got to do the whole
process with that.
So you're going to have to hallucinate some data. And it's critical to hallucinate
that data well, right? You've got some manifold high dimensional space. You
really want to a sample of P of X. You want your samples to be on that manifold
because you don't want this compression model learning anything off the function
off the manifold. You don't want to distract it. And you don't want to waste your
samples off the manifold.
Right? If you double the size of the manifold in each dimension, high
dimensional space, the vast majority of samples will not even be on the manifold
of interest. And you've just wasted 99.999 of your samples. You can't do that.
You have to stay true to the manifold as much as possible. What we'd like is two
to the manifold plus epsilon. Because that's better than being true to the
manifold minus epsilon because then you miss the edge of the manifold and the
edge of the manifold it could be important.
That's what we're looking for and this all gets harder as dimensionality increases.
So what are we going to do? We're going to just look at three methods here.
One is the simple straw man, just univariate modeling of the density. That's not
going to work well, as you know.
>>: (Inaudible) might be incredibly hard for ensembles, especially, what if you
tried to sample just around where the actual boundaries were?
>> Rich Caruana: Right. Some sort of active learning almost, where you're
trying to focus on the classification region? So it wouldn't work for regression
and it might or might not work if you're trying to predict probabilities but it might
be great for classification.
>>: (Inaudible).
>>Rich Caruana: No, there are times when classification is all that counts.
>>: (Inaudible).
>> Rich Caruana: I tend to like them. But it doesn't mean to you have to have
them in all applications. So it would be fun to talk about.
>>: At the end of the talk you're talking about a bunch of different (inaudible)
then you're talking about one data set, right? Because you have one feature
(inaudible) for different data sets? So which one are we talking about?
>> Rich Caruana: Actually, I'm going to average over eight different data sets
which are subset of data sets from the early part which you didn't see what they
were anyway. So we're going to show you some results for eight different data
sets.
In some cases I'll just average over all of them to give you the big picture. In a
few cases I'll drill down and show you what happens with specific data sets. All
right. So I'll just look at the straw man approach.
This is Allowance and Dominguez approach, Bayes estimation. I'll describe that
and then our approach, munging. So let's say that this is a representation of the
true distribution that we would like to sample from. So think of that as an infinite
sample of this nice 2-D distribution. And we'd like to be able to get samples like
this. So that's just a sub sample of this. That's the kind of thing we need to be
able to generate those sorts of things.
Okay. One way we do it is just 2-D, we could just model P of X 1 and P of X 2
individually. Maybe model them with a Guassian or use your favorite method it
could be peace wise or spline. Any way you want. Just model them separately
so we'll lose the conditional structure of the data and let's see how that works.
Okay. Of course, it doesn't work. It doesn't know anything about the hole in the
middle. It doesn't know anything about rounded off edges. Of course, it
generates bad samples. Now, it does cover the space. Maybe that's sufficient.
Maybe covering the space it could be more efficient but maybe covering the
space is okay. Because that's what we need to cover the space. That's the most
important thing. If you don't cover the space you're just in trouble with
compression.
>>: There's an issue, though, classifier performs well -- you have no know idea
what it's going to do off in the whacky space.
>> Rich Caruana: The silly thing is our goal is to mimic the thing. Who cares for
mimicking it in the regions for which it didn't learn anything interesting and will
never see a test case. We only want to mimic it on the manifold because that's
the only place where it actually did anything interesting in the first place.
And we want to do that also because if you ask a model to learn about a whole
bunch of other regions, it takes more, right, it takes more hidden units or
whatever to do it. And it's a harder learning problem. It has to learn about a
much larger part of the space.
Most parts of which are probably just sort of crazy. It's only on the manifold
where it makes sense, right? It might just saturate if you're lucky in the off
manifold areas which would be at least easier to learn or it might just do crazy,
who knows, polynomial curve fitting with too many degrees of freedom.
>>: One could argue in order to, because if you have data sets from everywhere,
in order for a model to then replicate the sub data to have the complexity of this.
>> Rich Caruana: Right, right. It wouldn't necessarily have to have the
complexity of the ensemble. I think one of the reasons why these averages over
many models are so effective in machine learning is because they actually yield
a regularized model. They actually yield something simpler.
The function at least is simpler even though our way of representing it is more
complex. It just tells you about a weakness we still have in machine learning that
we don't know how to fix that problem other than by averaging a whole bunch of
things.
It depends on what we mean by complexity. Okay. So this doesn't work so well.
Clearly what happens is we lose all the conditional structure of the data. But who
knows, maybe it will still work for compression.
Our goal isn't really to generate good samples. Our goal is to do compression.
Maybe this will do it. So here's the experimental setup. We have eight problems.
This is their dimensionality, 20 to 200. Most of them are fairly balanced. Only
one is this imbalance that has 3% positive class. Training set size is always
around 3,000, 5,000 points. We have test beds. Don't worry about it. Just need
to tell you that. Here's the number of graphs.
This is squared R. This is the performance of ensemble selection. And this is
the average over those eight problems. So you're going to be seeing an average
graph. That's the target we want to hit ultimately.
This is the performance of the best single models for those eight problems.
Remember, that is actually state of the art performance. You rarely have that
performance yourself unless you've trained 7500 models and picked the best.
That is truly excellent performance. You could be happy with that.
This is just even better, right? If you can get this, that's great. This is the
performance of the best neural nets that we could train on the original data for
those eight problems.
So that's sort of hopefully we can beat that. All right. So this is what happens.
And we're varying the size of this hallucinated synthetic data set that we've
labeled with the ensemble. We're varying it from 4K to 400K on the bottom here.
And I think you can see the performance, I'll explain why we start at 4K in a
second. Performance starts sort of comparable to a neural net. Never really
does much better than. It actually does worse in the long run.
Why does that happen? God was nice enough to give us 4,000 labeled points.
We would be crazy not to include them in the training set. Since ultimately they
are the function we were always interested in learning, not the ensemble. So we
always start with the 4,000 labeled points that we had in the training set for the
compression model. It just would be stupid not to do that.
And in this case what happens is eventually you've hallucinated enough bad data
and distracted the neural net enough that the 4,000 points get ganged up upon.
And the model's just no longer able to concentrate even on that function, and it
does actually worse than having not done this at all. So random data is not going
to work well.
So obviously ->>: I would expect it would have to be worse for the dimension.
>> Rich Caruana: Right, worse. It's an average over all those eight problems. If
we went to 2,000 problems, you're right, it would just be terrible.
Okay. So we have to estimate the density. So let's look at this naive Bayes
estimation algorithm. I'll run out of time. Let me do this very succinctly. It's a
very cool method. It wants to use naive Bayes to estimate density. We all know
naive Bayes is too simple of a model for anything complex. So what they do, this
clever thing. They keep carving the space into smaller and smaller sub regions.
Every time they carve it, they try naive Bayes in that sub region. If naive Bayes
looks accurate enough in that sub region they don't carve it up anymore. They
keep carving the space it until each sub region looks like it can be adequately
modeled by something as simple as naive Bayes. Now we have this sort of
piece wise model. It's a different naive Bayes estimator in all these different sub
regions. Some places get tiled with very small sub regions. Some places have
less conditional structure. They have very large sub regions. It's all dynamic.
>>: It's for density estimation.
>> Rich Caruana: It's for density estimation.
>>: They did it years ago for classification.
>> Rich Caruana: Yes, yes. This is for density estimation. And it's kind of
clever. It's efficient, which is nice. And they did it also because they needed to
generate synthetic data. So, in fact, it was just very natural for us to take their
method and try it because their goal was -- they weren't doing compression the
same way we were doing it. They were trying to do different. But ultimately they
needed to generate synthetic data. This is the method they came up with. It's a
very appealing method. We try it. We could get their code. It not only did the
density estimation but it generated samples. Perfect. So we really got to use
their method.
That's what we get. It's not bad. You can clearly see the ring. You can kind of
see there's almost decision tree-like artifacts where they've carved the space up
with axis parallel splits and things like that. So you can almost see the way the
space is carved. But it's not bad.
Okay. So the two problems with it are there is some artifact. That artifact would
get nastier in high dimensions. 2-D is a nice world. There's some points that are
off manifold. Other points in the middle, lot of points out here, where those points
shouldn't be there. But it's not bad.
So we were hopeful this might have worked well. So let's try it. So now we
have -- this is the random thing. This is the performance we get now when we
use this naive Bayes estimator. And it's not bad. So we have the neural net now
trained with 25,000 points. 4,000 of which are the original training set, which is
now doing as well as the best single model that could be trained.
So this neural net is now competing head to head with the best SVM best
boosted trees the best anything that could be trained. So that's actually good.
That's already perhaps a useful result often enough. So however it does have
this behavior that obviously we're not true enough to P of X. And eventually it
sort of hurts on average.
It's not like that for every problem. Some problems keep coming down. For
other problems it goes up. I'll show you some of those results if I have time. So
it's not bad. It's promising but we wanted to do better. We had hoped for more
from this method.
>>: That's the estimation sort of hallucinating area that doesn't matter anymore.
Seems as you increase the number of hidden units in the neural net, maybe that
curve sort of goes up later.
>> Rich Caruana: So we have tried varying the number of hidden units here, and
it does make a difference, but it's not as big of a difference as you would have
thought. And that's because we do early stopping and regularization on the
neural nets. So they're not going -- even when their capacity is large they're not
doing insane things because we have validation sets.
I mean, it's nice we can have as large a validation set as we can afford to label in
this world. So data is suddenly available to us in copious quantities because
we're making it up.
So it's a nice world to be in. Okay. So we developed our own thing. It's called
munging. And if you look in a dictionary, munging will be defined as imperfectly
transformed information or to modify data in such a way that you can't describe
what you did succinctly. That's exactly what we wanted.
No, it turns out this is after the fact a good description of our algorithm. It's not
that we wanted those properties. So here's the algorithm. So somebody gives
you a training set in two parameters. This parameter P and S which I'll tell you
about, and then some label data, the original, and then it's going to return the
unlabeled data.
Here's what you do. You make a copy of the labeled data. And now you walk
through all the cases in your data set. One at a time. You find the nearest
neighbor for each case using euclidean distance. K nearest neighbor. Find the
nearest neighbor for each case. If you two would stand up. No, you're fine. So
you're two cases. You're each other's nearest neighbor. You're the first case I'm
looking out and I find you're the nearest neighbor of this case. Now with
probability P, we're going to take your attributes and maybe swap them.
So we're going to swap your hair color. We're going to throw a coin. If the coin
lands heads, we'd swap your hair color. If your coin lands head again, we're
going to swap your shoe size and we're going to sort of mix and match you a little
bit that way. For continuous attributes like heights, we'll do something a little
more complex.
We don't want the heights -- maybe the original set only has 50 unique heights in
it. We don't want to be restricted to those 50 unique heights. So in fact what
we'll do we'll take your height, your height. We'll throw a little Guassian around
them, that's where this parameter S is. It's the width of the Guassian we throw
around your two points and now we'll draw two new heights from this Guassian
and we'll replace each of your heights with those Guassians. But that's it.
So we just do that for all of your attributes, and what that does is it creates new
guys like you but where we've changed some things.
And I hate to say the word genetic algorithm, but it is kind of like genetic
algorithms except the nearest neighbor is a critical step here. And you don't see
that in GAs.
>>: This is a joint question Janet and I raised yesterday. But why not just jitter
the data and just sample Guassian unique data point?
>> Rich Caruana: So we have tried that. The biggest difficulty is you have to
model -- you have to come up with an intelligent model of the jitter, and the jitter
model also has to be conditional.
If you just do it univariate, I don't know that it's going to work.
>>: What I was going to say, couldn't you use the neighbor distribution.
>> Rich Caruana: Oh, I see, I see. That might work. That might work. I think
you'll still have to come up with these parameters. But you might be able to do it.
No, no, that's interesting.
>>: Kind of like John's -- they give you the simplex sampling thing that you had.
>> Rich Caruana: That's interesting.
>>: Sampling along that simplex.
>> Rich Caruana: That would be interesting to try that.
>>: Now that I've switched that, do you (inaudible) at the same level?
>> Rich Caruana: No. Because we actually don't know what your label should
be. And since it's a probability, we hope it will have changed.
>>: But in those cases where I have checked munging ->>: You're ignoring the label.
>> Rich Caruana: You are ignoring the label. We have tried taking the label into
account to make this process better. And every time it's either hurt us or not
helped us.
>>: That seems like it should be the case because you wait until you actually run
the -- you're going to run the ensemble on that data. The ensemble has made it
where the boundaries actually are. And you want to maintain its version of where
that is.
>> Rich Caruana: This is just our way of sampling from the ensemble. Plus you
know maybe you were class zero and you were class one and now that I've
mixed you it's not clear what you should be.
And there are weird things about these probabilities. Like it turns out if we do
swapping with probability one, that's the same as doing swapping with probability
zero, except for the continuous attributes. So values between zero and .5 is
what makes sense. We use .2. So anyway we do this.
We pass through the whole data set and we do it and it comes up with a new
data set.
We use about .2 for that parameter. The variance parameter is not too critical.
Don't worry about it. You can also modify this algorithm in interesting ways. You
could do a bootstrap sample for example from the training set so that every time
we have to generate more data your nearest neighbor isn't always the same.
Because maybe you won't be there the next time and your nearest neighbor -there are lots of variations to this. But I'll just show you results for one variation
with one parameter setting.
So remember here walk you through the algorithm, but I think you get the idea.
It's a blowup on the small region. This is point that we're picking. We find its
nearest neighbor through distance calculation. That's its nearest neighbor. Now
what we do we flip a coin and swap some attributes in continuous values we
hallucinate some similar points.
And we end up generating these two points from those two original points. And
we just go through the whole process and we keep doing that, and let me just,
because you guys know how this works.
>>: Does the successor markers lie on the density of the original data?
>> Rich Caruana: Yes. And so especially in high dimensions. Simple things like
euclidean distance probably aren't going to be adequate. In some sense you
have low density in high dimensions unless the world is very nice to you. So I'm
sure this process is going to break down in very high dimensions. It's doing well,
though, up to 200 dimensions in our test problem so far.
It's not critically sensitive to this. But that is an important issue. You can use it
another kernel if you want for measuring finding who the neighbors are. If you
have some smart even learned kernel you can use that for your neighborhood
function as opposed to using something simple like we're doing to try to mitigate
this high dimensionality again.
Remember, that's the kind of sample we're hoping for. That's the kind of sample
we get.
>>: How sensitive was it to those parameters?
>> Rich Caruana: It's not that sensitive. We find the probability of swapping .1 to
.3 works pretty well. I think we're using .2 in these experiments. And the
standard deviation for the continuous attribute, the only problem is if we set that
too large, then it turns out that you really do expand the manifold too much. But
there's not too much penalty for setting it too small. You just then don't
hallucinate as many new values.
>>: You impose the variance, you just don't measure the variance between the
distance of two samples.
>> Rich Caruana: I didn't go through the formula. Actually what we do we have
two samples. We calculate the standard deviation from the two samples and
then we divide by our control parameter S, which just lets us ramp up or down
the standard deviation and say only use half of the standard deviation or use
three times the standard deviation.
So it just prevents things from going too far into the tails and then getting off
manifold. But it could be sensitive on some problems. But it hasn't been on the
limited number of problems we've looked at.
It does pretty well. So those are samples generated by the munging problem.
You can see we get an occasional thread-like structure. You can see in some
cases they are sort of where other excursions existed in the real data. But not
always. So interesting things happen but it's pretty good.
So just to compare, this is univariate random which is not very good. This is
naive Bayes estimation which is pretty nice and that's what we're getting. I think
you can see that's definitely better.
And this is just plotting them on top of each other. This is the goal. This is the
real density. Remember, we want the real density plus epsilon. And I think you
can see there's a little blue for munging sort of sticking out on the edges and then
the green is from naive Bayes estimation. That's sticking out even more and it's
off sample and then the red, of course, we never thought that was going to be
good.
The question is does it work? Okay. So it's a deceptively simple algorithm but
it's pretty effective. In fact, I came up with this algorithm originally to do privacy
preserving data mining.
So we had real medical data. I wanted people to do machine learning on it
without screwing up the data, preserving the real properties of the data, but I
couldn't give them the real data. So we did munging on the data to sort of
obfuscate it and it worked reasonably well.
It never explicitly models P of X, unlike other methods that you might imagine it's
more the data is the model as in K nearest neighbor. And a critical thing is this
nearest neighbor thing. That's what preserves the conditional structure. The fact
that we're only swapping values among nearest neighbors is what means the
conditional structure tends to be preserved.
If you didn't do that -- if we just swapped values between things that were very
different, all the conditional structure would be gone. So that's the important
thing.
Well, how well does it work? I need to wrap up here. That's the graph we saw
before. And that's the graph we get with munging. So I think you can see that on
average across these problems, as we start getting out to a couple 100,000
pseudo labeled data points, we're actually doing quite well. We're doing better
than the best single model, and we're approaching the performance of that super
ensemble. So we're getting there.
That's very nice. Let me just walk you there you some graphs. This is one
particular problem. Random doesn't do very well. Naive Bayes estimation is not
so bad, but we do much better -- don't you have to leave, too, now?
[Laughter]
But you can see that munging is doing very well. By the time you get out to a
couple 100,000 points, in fact it's equal in performance to the target ensemble.
>>: What's P1.
>> Rich Caruana: I'll tell you about this in a second. Yeah, good. Here's a
different data set. This is really, really nice. You love it when this happens. I
mean mung is working very well. It's better than the ensemble. We are actually
training a neural net that is somewhat better than the target ensemble. That's
great. I mean the neural net is sort of an additional regularizer on top of the
ensemble. So you shouldn't be surprised. It could be this ensemble wasn't so
big in the first place. It might have had some rough edges and maybe the neural
net is really doing a good job ->>: Sample trained on the training site of (inaudible) I guess the labels?
>> Rich Caruana: Right. So the ensembles train on the original 4,000 labeled
points and then this is trained on that 4,000 plus a certain amount of hallucinated
points. This is great. Sometimes this happens. And that's wonderful. Now it's
really a super, super neural net. It's better than even this other thing, which is
really good.
>>: What if you get one ensemble in a smaller training set and use that
ensemble to label more hallucinated data points and build another ensemble.
>> Rich Caruana: Oh, oh, that's interesting. Try to lift yourself up by your own
ensemble straps or something. No, that's very interesting.
Wow. I'm having trouble getting my head around what that would do. My guess
is it would eventually hallucinate and go south. But there is a chance that it
would just keep self regularlizing in an interesting way.
I have to think about that one.
>>: I think that's triple I or W -- comes down to something very similar. For small
data sets from boost increase.
>> Rich Caruana: That's right. So this is University of Texas, Ray Mooney
and ->>: (Inaudible).
>> Rich Caruana: Thank you. I think I have it in the related work. There they
use random. They don't use a sophisticated way of generating random data. But
they generate this extra data to create diverse ensembles and take the average
of the more diverse ensembles so the random data has forced diversity and that
extra diversity pays off when you make the super ensemble and they get better
performance.
There's a chance this would have -- I like that question.
>>: It sounds like the successive munging seems to -- I mean if you were to try
to prove something about the success rate, it would probably be involve like
assuming that the space of nearest data comes from -- can be well modelled by
locally smooth rectangular patches.
>> Rich Caruana: Right.
>>: Basically the probability distribution is kind of locally exchangeable.
>> Rich Caruana: Uh-huh, I think you're absolutely right. Fernando Perrera saw
this one at a post run a couple of years ago. He said it's really interesting.
Seems to work really well, but, man, you don't know why it works. What could
you prove about it?
And that's one of our future works is it would be nice to have some idea of under
what conditions is this most likely to work. And under what conditions would it
fail.
>>: (Inaudible).
>> Rich Caruana: If nearest neighbor is really -- hard to say. Because our goal.
>>: Interleaving two classes.
>> Rich Caruana: Yeah. As long as the ensemble has learned the two classes
our goal is ultimately just to mimic the ensemble. So as long as the ensemble
could do the two classes it's okay if the density estimation is not perfect. It
doesn't depend -- we want the density estimation to be as good as possible
because then everything else is more efficient and we get better compression.
But it doesn't have to be perfect. The ensemble, though, has to be as perfect as
possible because it's ultimately the target. Okay. I better wrap up. So let me -anyway, sometimes we actually do better. This is average results of problems by
number of hidden units.
I think you can see we do sort of need a modest number of hidden units on
average. 128, 256 to be able to this compression. Here's if we look at these
problems again. And this is where I'll tell you what these are.
Letter P 1. Turns out 800 hidden units is almost enough. Letter P 2, it turns out
128 still isn't enough. You need a lot.
What's the difference between these? Letter P 1, we're just trying to distinguish
O from any other letter in the alphabet. Letter P 2 is a much harder problem,
distinguish the first half of the alphabet from the second half of the alphabet.
That's a hard problem.
>>: Is this OCR, that people don't know ->> Rich Caruana: This is UC Irvine. This is a much harder problem. And
interestingly, this thing tells us, oh, I need a much bigger neural net to learn the
function that the ensemble has learned.
So that's kind of interesting. This might suggest that there's a way here of
coming up with some crude measure of the intrinsic complexity of a function or of
a data set by sort of just saying, oh it was a 10 hidden unit problem. It wasn't an
issue. Notice, you couldn't tell it was a 10 hidden unit problem using the original
data because now there's all sorts of overfitting issues, and sometimes big nets
fit over less, and that's complex. But here you can talk about overfitting in a
controlled way because you have this sort of very large data set.
So that's kind of interesting. Doesn't always work perfectly. And then I will stop.
Here's the tree cover type problem. You can see that even with 128 hidden
units, although we're going downhill, we need a much bigger network. The
compression is not going to be as good on this problem if we need a thousand or
2,000 hidden units. It's no longer a timing model.
>>: The first two classes? The two big classes?
>> Rich Caruana: I think we took the big class versus all other classes.
>>: Okay.
>> Rich Caruana: Good question. I think it's a seven-class problem. So we
converted it to binary by doing the major class versus all other classes. Thanks.
Let me just show you this. See this gray line -- this is the line you saw before for
munging for this data set. We need a lot of data. We not only need large
networks but we need 400,000, 800,000 millions of points. This gray line, this is
a large data set. We actually can take a lot of the real labeled data, throw away
the labels, treat it as unlabeled data and then label it with the ensemble and we
can ask what would happen if we had real unlabeled data from the right
distribution.
And notice it is significantly better than if we have to create synthetic data. If
somebody gives you the real unlabeled data from the right distribution, almost
always that's going to be significantly better performance by doing that.
Now, maybe we'll eventually get to the same place by having to create 10 times
as much synthetic data. But boy if you have data from the right distribution, it's a
good thing.
And I think I should wrap up. Here's a problem that's just -- I'll end with this last
problem. This problem is annoying. This is also a UC Irvine problem. And you
can see although we do marginally better than the best neural nets, the best
neural net on this problem, we don't do anywhere near as well as the best single
model which wasn't a neural net.
Or as well as the ensemble. Notice, there's a big gap between the best neural
net and these other things. This seems to be a problem for which neural nets do
not do well. It's like neural net hard in some sense. And that's interesting. So
we're interested in ->>: (Inaudible).
>> Rich Caruana: Yeah, but some of the other data sets are very noisy, too.
The neural nets actually do very well.
I think it's also a sparse one way of coding this it creates pretty sparse -- the truth
is neural nets in very high dimensions where the data is always sparse are doing
quite well. I don't know what it is about this problem.
But it's interesting we may be able to find problems that are just sort of one
hidden layer neural net hard. For some reason it doesn't work well. And this is
one of those cases where you might really need to use an RBSVM to do well on
the problem. We might have to use something else.
>>: Given that your target is random performance, though, you could try one of
every single model, right? As long as it's cued, as long as the model
performance is cued.
>> Rich Caruana: Right. You could try everything and see what gave you the
best compression for performance.
>>: Because that happened to be over the course of everything.
>> Rich Caruana: Yes, you're absolutely right. You could imagine a nice 2-D
plot, which was the compression or speed-up, and then on the other axis was the
accuracy preserved. And you could pick wherever you wanted on the trade-off
curve, right. Whoops, I didn't mean to do that.
Let me just summarize the results and then I have four seconds. Here are our
problems. This is the squared loss of the ensembles. Let's just look at the
average here. Speed this all up.
We're able to preserve on average -- so how do I measure this 97%? I take the
squared error of the best neural nets we could train and then I take the best
squared error of the ensemble, which is our target. And then I ask, when we
compress, how much of that improvement do we capture? 97% means that we
get 97% of the way from the best neural net to the ensemble model. So that's a
much stricter way of measuring 97% than, say, taking baseline performance and
saying we capture it. Because there we would capture 99.99%. We're actually
starting from a good model, good neural net. We're saying it has to be much
better than that to be 97%.
So we actually capture 97% accuracy of the target models. That's only going out
to I think 128 hidden units. We know on one or two of the problems we would do
better with more. That's only going up to 400K training points.
So how about the compression? Since that's the goal. Well, here's the size of
the models in megabites. The ensembles on average are half a gigabyte. The
neural nets that you could train just on the original data are quite tiny. The neural
nets that we train to mimic this ensemble are bigger on average, you'd expect
them to be, are bigger on average. This is a little bit of Breiman's constant sort of
coming in there. They're bigger on average than these humbler neural nets but
not that much bigger.
And on average we're getting a compression factor of about 2,000 of the neural
nets compared to the ensemble. But the ensemble could be anything. It doesn't
have to be our ensemble selection. It could just be any high performing model
that you have lying around.
So this comparison is almost not so interesting. What I find interesting is this
comparison. Right? Which is that you think of this as .1 and .3. By making the
neural net just three times larger than you would have naturally made it we get all
this extra performance. That's really what counts. Because who knows what
your target might have been. That's what counts. Three times the complexity we
get a great neural net out of the thing.
In terms of speed, we get similar numbers. So this is the time to classify 10 K
examples so this is the number of seconds. The ensembles are quite slow. Half
a seconder per case.
The neural net is very speedy, two seconds to do 10,000 and you could make
that much faster if you wanted to. This is just a trivial implementation. We're just
a little bit slower because we're a little bit bigger, as you'd expect. Not that much
slower, and on average we're speeding things up by a factor of a thousand.
That's what we ultimately need. But, again, this is probably the more interesting
comparison.
We get all that performance but by only making a neural net about two or three
times bigger and slower than had it been had you trained it naturally. That's what
I like.
Here's a summary. I think you know that. So let me just -- I think you already
know why it works. So I think all of our discussion has been about why it works.
I just want to say clearly the fact that you have a large ensemble does not mean
you have a complex function. In fact, I think as we said earlier often these
ensembles are reducing complexity not increasing complexity.
So I probably should skip this. I think I probably should skip it, unless you want
to stay for three or four more minutes. All this work has been done with squared
loss. Remember those big tables had AUC and log loss and accuracy and F
score and all those things. Everything I've just shown you was just with squared
loss.
So an interesting question is, well, hey, if we compress an ensemble trained to
minimalize AUC and train the neural net with square loss to do that, will it turn out
to be good on AUC?
We typically see what I call the funnel of high performance. This is from a
different set of work we've done. This is accuracy. So high accuracy is over
here. High ROC area is over here. Anybody who has to go, please go.
And we often see this kind of behavior which is once you get to very high
performance on a problem, this is two different metrics. Just a scatter plot of two
different metrics. Lots of different models. Neural nets. Bag trees, boosted
trees, just wherever they fall, they fall.
Once you get to very high performance, there's very little, if you're a great AUC,
well, then you can have great accuracy by finding a good threshold. You can
have good probablistic prediction by doing something like plat scaling.
Basically you grade AUC you can do well on almost any other metric. Any
reasonable metric is sort of going to be great if you have good AUC. And that's
true for most of these metrics. So all the metrics, when you get to the very high
performing end, the funnel narrows.
So to the extent that we can get our target ensemble in the high performing end,
then I'm not too worried. It will turn out the neural net can mimic it. It can be
good on any metric, modular some small epsilon. However, on hard problems,
for which we can't hit the sort of high performing asymptote, face it a lot of
problems fall out here it's an open question on whether us training a neural net to
minimize square loss will do it. Ultimately, if we have enough data and we
exactly mimic the real values that the ensemble is predicting then we'll be equally
good on all metrics, because once you hit the target function you've got it.
But there really is a question there of how hard do you have to work to get this to
be as effective with other metrics. And I won't mention all these other things.
There's just active learning would be fun. Better ways of doing density
estimation. When should we be using deep nets or some other model like
pruned SVMs to do compression instead of this.
Should we calibrate the models before we do the compression? Should we take
the ensemble and calibrate it. Or should we train the neural net on the
uncalibrated thing and then calibrate the neural net afterwards. Or should we
calibrate twice before -- I mean there are all sorts of funny extra questions, does
this all fall apart in high dimensions?
Let me just mention one piece of related work. So the early work of Craven and
Shavlick, and I think Towell (phonetic) did some of this even before then, what
were they doing? They had neural nets which were then the highest performing
models I think we knew about on average. But they weren't intelligible. So they
were creating synthetic data to pass through the neural net to then train decision
tree to mimic the neural net so they could understand the decision tree and
thereby understand the neural net.
And I think it's just ironic that in some sense we're doing the opposite. We're
taking these ensembles of decision trees or ensembles of lots of things and we're
going the other way and we're trying to put them into a neural net in order to get
a compact high performing model. We're using intelligibility in the process, but I
just think it's funny that they were doing something so similar 15 years ago.
And I really should stop. So thank you and I apologize.
(Applause)
Download