15955 >> John Platt: Okay. I'm very pleased to introduce Rich Caruana today. He's an assistant professor at Cornell and a well-known machine learning expert. He got his Ph.D. back in CMU in 1997 and has been in various places like Justice System Research, if you remember what that was, and has been a professor at Cornell, and he's well known for doing interesting kind of new models of machine learning such as clustering with hints or multi-task learning. That was some of your original work. So he's going to talk about model compression today. >>Rich Caruana: Thank you very much, John. So I want to thank John and Chris both for inviting me back for a second time and helping to arrange a visit. I'm going to be here for the next two days. But I'll be in this building today. So if anyone wants to try to grab me, please try. Okay. So I'm going to talk about model compression. This is new work we've been doing just the last couple of years. This a work in progress. I can't give you sort of final answers here. And it's joint work with two students at Cornell. Kristy Busea and Alex Nicuescu. And Alex is just finishing up right now. And first I guess let me -- whoops. Let me tell you what I'm not going to talk about. So I'm probably best known for my work in inductive transfer and multi-task learning. So I'm not going to be talking about that sort of thing at all. And as John mentioned I've done some work in some of the supervised clustering and meta clustering. That would be a fun thing to talk about actually. I almost decided to talk about that and John said there were some people here who were interested in the compression stuff. So I decided I'd talk about compression instead. I won't talk much about learning for different performance metrics. Although, that's going to come up in this talk. And I won't talk at all about medical infomatics. We've been doing some fun stuff in the citizen science arena, learning from very messy citizen science bird data. And it's really fascinating. We've come up with some cool ways of regularizing models when things are this noisy. And I'm not going to talk at all about microprocessors. In fact, there's a person who joined Microsoft Research a year ago, Ingin Nypek, who has been my collaborator in this work. So if you're interested in that, he's here, so go look him up. So those are things I won't talk about. So I'm going to talk about model compression, but I'm going to start and spend -- it might be the first 10 or 15 minutes talking about something else to help motivate the need for model compression, because I don't want it to look like an academic exercise. I want you to realize why you have to have something like this. So I'm going to sort of talk about Breiman's constant. It's almost a joke, illusion to Planck's constant in physics. And I'll spend most of the time talking about ensemble selection which motivates why we need to do this compression. And then I'll jump into compression and talk about density estimation and show you some results and then the future work. Okay. So let's see. Does everybody here have a machine learning background for the most part? >>: Mostly. >> Rich Caruana: Great. So I can skip sort of the machine learning intro that I had prepared just in case people didn't. And let's jump right to the fun stuff. So let me tell you about Breiman's constant. So Larry Breiman and I were at a conference in '96, '97, chatting during a break. And he said something like: You know, it's weird every time we come up with a better performing model, it gets more complex. So when we were talking about boosting and bagging and how decision trees, one of their beauties had been that they were intelligible, and now that we were boosting and bagging them, you know, we had even lost this beautiful property of decision trees, and that wasn't that a shame. Sort of jokingly I said, oh, they're liked linked variables, Planck's constant. >>: (Inaudible). >> Rich Caruana: Exactly. You know, you can't know both the location of an electron and its momentum, both infinite precision at the same time. If you know one well, then you inherently know the other less accurately. I sort of joked well maybe something like this is true in machine learning. And if the error of the model is low then the complexity must be high and vice versa. So there's this natural trade-off. And the reason why I put this here, I mean it's mainly a humorous way to start the talk, but we are going to be talking about this sort of thing. Model complexity and how big does the model have to be if it's going to be accurate. Is it always going to be the case that the most accurate models are humungously big and all that sort of stuff. And ultimately compression is going to sort of try to get around this simple statement. So okay. . Here's a huge table. Don't worry, this table is taken from some other work I've done. And I'm just going to use it to motivate the need for this model compression stuff. But I am going to have to explain it. There's a lot of numbers here. Let me walk you through it and it will suddenly make sense and I promise there will be a lot more pictures coming later in the talk that you won't have to look at these sorts of things. Here we've got families of models. All different kinds of boosted decision trees, a variety of random forests with different parameters and different underlying tree types. A variety of bag decision trees, support vector machines with lots of different kernels and lots of different parameter settings. Neural nets of different architectures, trained with different learning rates of momentum, things like that, different ways of coding the inputs. A variety of neural nets. Every memory-based learning method we could think of, including combinations of them. There's actually 500 different combinations of these hidden under that. Boosted stumps. There's relatively small class. Vanilla decision trees. There's only a dozen flavors of decision trees. Logistic regression. Run a few different ways with a few different parameter settings and also naive bays, run a few different ways. So it turns out that this represents, this column, 2500 different models that we've trained for every problem we're going to look at. So we really went crazy. We did everything we could think of to train good models on these problems. Then we went a little crazier. PLT stands for plat. So some of these models don't predict good probabilities right out of the box. Others do. Neural nets can predict pretty good probabilities as can K nearest neighbor. But turns out boosted trees don't tend to predict good probabilities, neither do SVMs unless you do some work. So John Platt came up with a method for calibrating models. You first train the model, then apply Platt calibration as a post calibration step to improve the quality of the probabilities. It turns out that works very well. There's also a competing method isotonic regression, which also is good in other circumstances. It turns out John's method to just summarize a piece of work we did a couple of years ago, John's method works exceptionally well if you have limited data which you often have when you do this calibration step. If you have lots of data, the ice tonic because it's ultimately a more powerful class of models, may work better if you've got lots of data. But if you have little data, you should stick with Platt's method. Star means you didn't need to do any calibration. So we've taken this 2500 models and now we've calibrated them all using Platt's method, using isotonic regression and not using any method. So in fact this becomes 7500 models, if you do the three different calibration methods. Okay. So we have 13 different test problems that we're working on. And what we do is we do everything we can to train a good boosted decision tree on every one of those test problems to get the accuracy as high as we can. Question? >>: What's the size of the problems? >> Rich Caruana: So all these problems have dimensionality, about 20 to 200, and it turns out the train sets we've artificially kept the train sets modest at about four to 5,000 points. Even though on some of these data sets we have 50,000 or more points. And that's the sort of to make the learning challenging and it's also to make the experiments computationally feasible. Do you need more information or is that good? >>: Is this coming -- based on the data set? >> Rich Caruana: That's a good question. So half of these are from UCI because you sort of have to use UCI so other people can compare and the other half are things for which I really have collaborators who care about the answers. Because I don't believe in sort of overfitting to UCI. And, of course, we put in more effort into the ones we have real collaborators for. So that's a good question. Okay. So what have we done? I'm just summarizing other work here. This is really still motivation. But if you don't understand what the numbers are, it won't help motivate things. So what have we done here? We've trained as many basted decision trees as we could on each of those problems and using a validation set we picked the particular one that's best for each problem for accuracy. There's five-fold cross validation here. So these numbers are fairly reliable. Then we've normalized these scores so that for every performance measure, no matter what it is, even for like squared R, we've normalized it so that one is truly excellent performance. Nothing really should be able to achieve one because we had to cheat to get that high quality. And zero would be baseline. So hopefully nothing is doing near baseline. So basically just think of these as a number near 1 means really, really good. >>: Are these binary classifications? >> Rich Caruana: They're all binary classification problems. In fact the few problems that weren't binary classification problems we turned them into binary classification problems. That's why we can do things like AUC, the end of the RC curve. Thank you for asking that. This is the average performance we could achieve with a whole bunch of boosted decision trees with and without calibration on the 13 problems, when we're doing model selection for accuracy. And this is the average performance we could get when we're trying to optimize for F score, which you might not be familiar with, but don't worry it's not going to be important for the talk. And this is the best we could do for lift. And this is the best we could do for area under rock curve. Best we could do for average precision. The precision recall break-even point for squared R and for good friend log loss. So that's the best we could do with boosted trees. And this is the best we could do with random forest. So remember these are averages over many experiments. Different models are being picked for different metrics and for different problems, but they're all boosted decision trees in the first row. They're all random forests in the second row. Different neural net architectures in this row. But what is best for each problem. So this is sort of the big average picture of performance. And looking at different metrics. And then this is the average. This last column. So that's the average performance across all the metrics. One advantage of normalizing scores in the way we've done it is the semantics are pretty similar from score to score and from problem to problem so it actually makes some sense to average across them this way. So that's our real motivation for having done that. >>: As far as the error wise is that just to (inaudible). >> Rich Caruana: That's a good question. It turns out the differences in this final column of about .01 are significantly significant. Here we're not averaging over quite as many things. So .01 is possibly significant. So sometimes it is and sometimes it isn't. It's difficult, by the way, to do real significant testing. We've got five fold cross validation under the hood. And it turns out the performance on the different metrics is highly correlated, if you do well on some metrics you do well on the other metrics. It means they're not truly independent. So it's hard to know anyway. So when I bold things, I mean that we've sort of looked at the numbers behind the scenes, done some sort of simple T tests and if we have bolded them that means you should view those things as statistically identical. It's not a reliable test. So it doesn't mean that every time things are bolded they are indistinguishable and if they're not bolded they are distinguishable, but I just wanted to guide your eye because we can't scan big tables of numbers like these. In fact, if you want to focus on the mean that would be fine. Although, these other columns will be important later on. What have we got? We've sorted things by overall performance. So boosted decision trees are sort of at the top of the table. But if you look at this mean performance over here, it will turn out this is a three way tie, really, for first place in mean performance. So boosted trees, random forest and bad trees are all doing exceptionally well across all of these metrics and across all these problems. >>: Limited (inaudible) for the boosted trees or did you ->> Rich Caruana: Ah, so we tried every kind of tree we could, including lots of different parameters for the trees. So we grow full-sized trees and boost them. We grow reduced trees and we boost them. We also do something with boosting that some people do and don't do. We do early stopping, which is we boost one iteration two, four, eight, 16, 32, out to 2,048. And the validation set is allowed to pick whatever iteration works best. It helps boosting a lot by the way. It wouldn't be the first place in this table if you didn't do that. >>: You don't do that? >> Rich Caruana: Surprisingly, some people don't do that. There's one problem here, by the way, where boosting starts overfitting on its first iteration, which means that the validation set prefers iteration one, which is before any boosting has occurred. And every iteration after that actually goes downhill on most metrics. So it turns out that boosting would not be doing this well. Boosting is a very high variant, risky method. But when it works, it works great. Random forests are doing so well for a completely different reason. They're sort of reliable time and time again. They never really grossly fail. And, similarly, bagged trees are pretty reliable. Okay. >>: This may be philosophical. But you mentioned you have a three-way tie for first place; isn't that more like a five-way tie? >> Rich Caruana: Yeah, so it turns out that these are statistically distinguishable as far as we can tell. No matter -- we've done bunch of bootstrap analysis with the data and we always get this sort of same picture. Maybe two of these things would change places but we never see these things move into the top or those things move down. So we tend to believe that SVMs and neural nets really are in this sort of tie for -it's not second place, it's third and a half place. Four and a half place, I'm sorry. By the way, so don't overfit to these results. These are all problems of modest dimensionality, 20 to 2200 if you're doing bags of words. In fact, we've repeated this kind of experiment in high dimensions. The story is quite different in higher dimensions. If you want I can tell you about it sometime. As you expect the linear methods really come of their own once the dimensionality breaks 50,000 or 100,000. But for the world where you've got say 5000 training points binary classification and modest dimensionality this seems to be a pretty consistent story. People tell us your story is like this because you forgot to include data sets that had these difficulties so then we added those data sets and the story didn't change. So, okay, some interesting things. Let me come back to that in a second. Interesting things are the top of the tree is ensemble -- I'm sorry, the top of the table is ensemble methods of trees. So I didn't necessarily expect that when we started doing that work. So that's kind of interesting and that itself is going to be part motivation for the need for the compression work. >>: I meant to ask you, you never tried an element that's of logistic regression? >> Rich Caruana: No. Although we have done ensemble methods of perceptrons, inverted perceptrons. And they do pretty comparably especially in high dimensions to logistics sometimes. We've done ensembles of those. It's not here. They don't do as well as these. Except in very high dimension. Very high dimension, the story is quite different. In fact, logistically regression all by itself in extremely high dimension starts to match the performance of the best of these. Boosted trees it turns out don't hold up very well in very high dimension. To our surprise, random forest just take high dimensionality in stride and do extremely well in high dimensionality. HATS (phonetic) also do surprisingly well in high dimension. I didn't anticipate that. I thought they would also have trouble. SVMs, because we include linear SVMs in our mix of SVMs, they also do well in high dimension because you end up picking the linear kernels for SVM. >>: (Inaudible) >> Rich Caruana: No, in that case, in very high dimensions, you sort of have to have more data. We're using the natural size of these other 12 test problems that we have. So in some cases it's as much as several hundred thousand points in the train set. Good question. With 5,000 points in the train set. I don't know if 100,000 dimensions sets would mean much. Very good. The tree methods are doing very well. You might think -- might look at this table, great, nice answer for machine learning. As long as we use one of these ensemble of trees we've got our top performer, works well across all sorts of metrics. Great. You'll be sad if you're using an SVM aficionado because it didn't quite make it to the top. Turns out that's not the conclusion you should draw. We're adding one more line to the table. Remember, this is the best we could do with boosted trees. This is the best we could do with random forest. This new line, we're just agnostic to the method, we take the 7500 methods we've trained down here and we just use the validation set to pick the best one. So this gets to use anything it wants. And the surprise to me is not that it's better. You'd expect it to be better. It's how much better is the surprise. So remember this was like a three-way tie for first place and then these guys were sort of coming in second. Well, the differences here are dwarfed by this difference. So this tells you that it's not the case that boosted trees or random forests or bag trees are always consistently one of the best models. The only way to get that large difference is if occasionally some of the models down here are the models that are best. In fact, when you look at the details you see that. You see there are problems for which logistic regression which average performance is not so high, logistic regression is the best model by a significant amount on this problem and if you didn't use it, you actually are going to lose significant performance. Because these models don't do it on that problem. And there's another problem for which boosting fails miserably, random forest, none of the tree methods, truthfully, do very well. But it turns out boosted stumps do extremely well on this problem. And if you weren't looking at boosted stumps you would have trained an inferior model. >>: Just to make sure. The procedure for raising that first row is you're taking the validation set dividing it by the model and that's the number that's set? >> Rich Caruana: Yes, thank you, right. That's -- that's right. So we train 7500 models. We use our held aside validations, pick the one that looks best for accuracy on each problem. And then on the big test sets that we have we report the performance and convert them to standardized scale. Thank you. I should have said that. >>: (Inaudible) but is it the same (inaudible) if you do the same thing (inaudible) I would think that you would not only fit near as much. >> Rich Caruana: Right. Right. The differences between methods start to become less in very high dimensions. And the linear methods catch up with all the other methods. And then a few methods clearly break. And boosting is one of them. So boosting just overfits dramatically in very high dimension, unless you've done something to regularize it. >>: What about the much later training set, is it the same? >> Rich Caruana: Good question. The high natural experiment we've only done with the natural size of the data sets as other people experimented with them. And for these data sets we've only done experiments with the sort of 5,000 training sets. So I've never actually been able to do the learning curve. The experiments are expensive. It takes us several months to create this table and in high dimensions it took even more time. We had to write a bunch of special purpose code in fact to do the high dimensional experiments. It just wasn't that easy to train some of these things on that high dimension. Sure. >>: (Inaudible). >> Rich Caruana: Which single model? >>: Yes. >> Rich Caruana: So it does turn out that the best single model on average is boosted trees. >>: The very best one is (inaudible). >> Rich Caruana: Oh, oh, I'm sorry. So this thing could be two neural nets on two problems. It could be three boosted trees on three other problems. It could be random forest on four more problems. It could be logistic regression on one problem and boosted stumps on another problem. And that's for accuracy. Then when we go to RSC area, it may prefer very different models. It might suddenly prefer neural nets and K nearest neighbor. In fact, K nearest neighbor and neural nets do quite well on ordering things like RSA. So there's no easy answer to that. But the important thing is that it's not the case that just sticking with a few best methods is safe if you really want to achieve the ultimate performance. You actually do it, sadly, it looks like you have to sort of try everything and then be a good empiricist and use a validation set to pick the best thing. So that's the take-away message from this. Okay. So now we're going to make that a little worse. Now, I've just said that if you really want high performance, you have to train 7500 models or 2500 models and calibrate in different ways and use the validation set to pick the best. Can we do something better than just picking the best? We all know what ensembles are. Can't we form an ensemble out of these many of the models we've just trained and possibly do even better than the single best model in that set. And you all know that as long as we have a bunch of classifiers, many of which are accurate, and hopefully they're different from each other, diversity in the models, there's a good chance we'll be able to form an ensemble that's better than any one of those models. Lots of ensemble methods around. In fact, we're using some of them in the table. There are things we're not using, error correcting codes. There's a lot of things we can just do. We can take an average of all those models. We can do bayesian model averaging where you take an average but now weight it by the performance of the model. So higher performing models get more weight. We could do stacking, which is trying to learn a combining model on top of the predictions of all those models using the validation set as the training set for the stacking model. All these ensemble methods really differ in just two ways. One, how are the base level models generated and, secondly, how are those models combined? So the base level models are generated by the process we just described. We're not going to try to build an ensemble out of those things. We basically train every model in the kitchen sink, just train everything, everything that makes any sense that we can afford to do and we keep it. Now we're going to try to combine them in different ways. So let's do that. So I've pruned the bottom of the table, but here's the top of the table. Here's the best line, which is all before -- you can now think of best as an ensemble. It's the ensemble which puts weight one on one of the models and weight zero on all the other models. It's a funny ensemble. You can't expect it to do better than the best model since it is the best model. >>: Stacking ->> Rich Caruana: So stacking, what we've done is we've tried to use, say, logistic regression to combine all the predictions. That didn't work very well. Then we tried using SVMs to combine the predictions. We had a lot of trouble. You'll notice the performance is not very good. We had a lot of trouble because the 7500 models are all very correlated with each other, and our validation set is modest. It's only a thousand points that we've held aside for validation. So the stacking always overfits dramatically. Now, of course, we can set parameters for stacking so it does something like just take the average of everything. We can tune stacking that way. But we wanted stacking to have the freedom to hang itself if that's what it was going to do. And we didn't force it to do average all, because we have that as our own separate line in the table. This poor performance we got with stacking, and base averaging by the way does -- this is not statistically meaningful just epsilon better than picking the best single model. There's different ways of doing bayesian averaging. We're just exploring one particular approach there. And now that we've made this work publicly available other people have come to us and suggested other approaches. We think that our results with both stacking and bayesian averaging are not the best that can be achieved. So we're confident if we spent some more time -- we put a fair amount of effort into it. We were surprised they weren't better. We thought they just would be right out of the box. We put some time into it. But we were surprised with the difficulty we had getting improvements from these things. But we're sure that there are ways to make this work. It's just that in this sort of month that we spent trying to make it work, we didn't hit on it. It will turn out for the model compression story I'm going to talk about soon, it doesn't matter. If you can make one of these things work, that's great. And you're still going to need the model compression that I'm going to talk about. But what we're going to do is just create our own stacker to combine these predictions, and we called ensemble selection. Because we were disappoint that those methods didn't work. We really thought there must be a way of getting even better performance out of this large set of models. So we just quickly tried something, and it sort of paid off right away. So let me describe that to you quickly. I won't go into the details of this. Train lots of different models using all the parameters and stuff you can. You've already seen that we're doing that. Just add all the models to a library. Don't throw any of them away. No pruning of models or anything like that, just keep them all around. And then we're just going to do good old forward step wise selection. If you've done forward step wise feature selection, which has been around for 50 or more years, we're just going to do forward step wise model selection. Just one at a time we'll add models from this collection into the ensemble in an attempt to keep making it greedily hill climbed toward better performance. Let me walk through that. Here's 7500 models. There's our ensemble, which we start off with nothing in it. And we've been asked to build an ensemble that optimizes area under the RSC curve. That's our job. Nice thing about this method is you can optimize to any performance metric as long as you can calculate it reasonably fast. Here's the RSC of each of these models individually. We find the one that has the best RSC and we put it in the ensemble. Now the ensemble is now just the best model. Now it's equivalent to that best line in the table in fact. It's not an ensemble yet. Okay. So that model is now in there. Now we go back to the remaining models and we figure out what would the RSC be if this model, Model Five, were to be added to Model Three and their predictions averaged? So it turns out to be .9047. That's not an improvement. 9126. That's not an improvement. Whoop, that looks pretty good. Better than what we've got up there. 9384. So we find the model of the ones that are left that's best to add to the ensemble and we add it. >>: Can you write it a different way? >> Rich Caruana: You can imagine different weights. We find it every time we make it too flexible it overfits. If we had 10 or 100 times more data the world would look quite different. And we were able to get mileage out of this very simple method. The only thing we do end up doing is we actually do I'm not showing it here, but we do greedy selection with replacement. Which means a model can be added two or three times. It turns out that's important for reasons I won't go into. And when a model is added three times it does get three times the weight of a model that's added once. So we do let it very crudely adapt the weights. Okay. So we go back to the well. We find the model that when added to the ensemble would make it best. We just keep repeating this until things stop getting better. Okay. . Now, the more models you've got, the better you have of finding a diverse set of high performing models that actually work together in a way that gives you high performance. That's great. But this overfitting is really a killer when you've got this little data. We've had to come up with a number of tricks to mitigate this overfitting. And I am telling you this because it's so natural to want to go and try this, if you've tried a bunch of models. And you might not realize that over -- if you don't control overfitting, you'll actually do worse than if you didn't do this in the first place. You're better off picking the best model than trying to greedily form an ensemble if you don't control overfitting. This is critical. That's the subject of another paper. We won't go into that. But you have to do something to take care of it. We have some hacks doing it. It works well. Here's how well it works. There's the ensemble selection added to the top of the table and I've rebolded all these the entries. I think you can see ensemble selection has gotten -- remember there best line was really -- I mean these were pretty good performance down here. These are some world class models. And we're trying every variation of them under the sun that we could afford to try. And picking out all the best things down here was much better. There's yet this other big increment by taking this ensemble of those different models. So that's nice. I mean that's what we're hoping to see, with some effort we actually got that. So that really is -- I mean in some sense it's beyond state of the art performance for simple machine learning. >>: Do you think, the principal, the bayesian average for any weighting scheme could have achieved the ensemble such that they were putting in zeros everywhere else. >> Rich Caruana: Exactly. >>: But, in fact, it doesn't because overfitting of the small DEV (phonetic). >> Rich Caruana: That's exactly right. There's something about the greedy step wise selection process with our overfitting controls that makes it a more effective algorithm, than these other methods which should have been able to do it but somehow couldn't. >>: You're not guaranteeing that, or are you, that the rise would be (inaudible). >> Rich Caruana: No, no. In fact the graphs are quite noisy. Turns out that sampling with replacement makes them better behaved. It's one reason we do that. It turns out, by the way, occasionally this thing overfits and in fact you would be better off for some problem in some metric just picking the best model and this thing is epsilon worse than that. It picked the best model and put it in first and then it made mistakes afterwards. But on average, across many problems of metrics, its performance is quite, quite good. >>: Your validation set was larger, would the results still hold? >> Rich Caruana: If the validation set was larger, I think the results would still hold. But stacking and bayesian averaging, I think, would be doing much better. Yeah. They would be competitive. In fact, it's possible they would outperform this. Bayesian averaging, I tend -- we can talk about this afterwards. Sort of a philosophical question. I tend not to think of bayesian averaging as being much of an ensemble method as being a hedging-your-bets method. So but we could talk about that later. It would be fun to hear what your opinions were on that. >>: Have you tried to expressly incorporate diversity. >> Rich Caruana: That's a great -- we did. And we read papers on how to calculate diversity between models. And the funny thing is we never got any mileage out of it better than ensemble selection. In fact, it was hard to duplicate the performance. If you think about it. By greedily selecting the next model to add to the ensemble to make its performances, in some sense it's implicitly thinking about diversity even though it has no explicit measures of diversity. It is going for the model from the large set that, when it adds to the models already there, sort of maximizes performance. The odds are they're diverse. Turns out, if you look at the models that get put into the ensembles it's really fascinating. It never, ever, not on a single problem or metric, sits there and chews up 90 percent on one model class and just adds a few others. It's not like that at all. It pulls in 23 percent of this class and 16 percent of that class. And it turns out your after squared R, throws in a bunch of neural nets. It's very interesting to see what it likes to use. It's just fun. You can spend days just looking at the ensembles and sort of telling yourself stories about them. >>: Did you ever try using the original training set for this ensemble selection instead of a validation? >> Rich Caruana: Yeah, and it just fails terribly. Turns out the performance in the models can be so good on the training set that it just fools itself right off the bat. Yeah. Yeah. You have to have the independent validation set or else it really does bad things. There are tricks you can do with five fold cross validation so you ultimately get to train on everything and validate on everything. It takes more effort to do it. But it does work. >>: Do you have any insights as to what it is about these data sets that prefer (inaudible)? Ideally you'd have some simple method that says now (inaudible). >> Rich Caruana: Yeah, yeah. >>: We're kind of anxious to find out. >> Rich Caruana: We did get some insights. But I wouldn't want to -- they're almost the kind of insights you would have before you looked at the results. Realize, we only have 13 data sets. And we don't have any well-defined way of characterizing them. So we have a small sample size in that sense. But you do see things like data sets that we knew were noisy, because of our previous experience with them, those are the ones boosted trees do not do well on. Although, random forest still do quite nicely on them. Bagging does quite well and boosted stumps do quite well. So things that you would expect sort of happen. Things like that. But I can't say we've had too many eurekas. Like, oh, if the ratio of nominal attributes to continuous attributes is greater than .5, then you should be using memory-based learning. Sadly, if we could do the same sort of experiment but now with hundreds of data sets, now we could actually do even sort of machine learning to try to take the characteristics of the data sets and predict what model would do well on it and what would do poorly. That would be fascinating. But we'd have to increase an order of magnitude or two orders of magnitude to be able to even touch that. So it's a great challenge, though. So this thing really works. It's kind of nice. This really is phenomenal performance. If you've got a limited amount of training data and you want great, great performance, and every little improvement you get makes a big difference. It either saves a life somewhere or it increases your bottom line by a million dollars, I mean this is a good technique. Okay. So it really works well. It's interesting that we haven't hit the supervised learning ceiling yet, right? I mean with some fairly simple stuff like training a bunch of models, picking the best and forming a greedy ensemble out of them we're really upping the performances that we see quite a bit. It's cool. You can optimize these things to any performance metric. That's one reason why they work well. We don't know how to train neural nets to any performance metric but we can optimize the ensemble to any metric, so that's kind of nice. It works even when some of the base level models aren't good on those metrics, because it can still perform combinations of models that work well on that metric, even though no one of the models work on that metric. So it has some room to sort of improve over the base level models can do. And then there's some nice things like it takes a long time to train all those base level models. But forming the ensemble just takes seconds on the laptop. It's actually the greedy forward selection. It just flies through that. And that's because we cache all the predictions. It's actually quite fast. And there are cool things you can do. Like you can imagine that the world changes. You have some new small sample of labeled training data. You don't retrain all the base level models. Maybe you don't have enough data to do that. But you could redo the model selection to build the ensemble. Just like that. You could do that every ten minutes, if you wanted to, if it was the stock market or something. You'd have to labeled data, new validation set to do it. So this has some nice properties. So this is good. Here's a really big problem, and this is the -- this is all motivation. (Chuckling) so there was a really big problem, and the problem is that these things are big. Way, way too big. So think about it. Some of the base level models are things like boosted trees and random forests and bagging and K nearest neighbor. So I just pull an ensemble out. And this ensemble had 72 boosted trees in it. Each boosted tree can have either a thousand or two thousand trees in it. Right? So 72 boosted trees turn out to be 28,000 trees. It had only one random forest. This one didn't like random forests. But even that had another thousand trees in it. It had five bagged trees. So that was 500 more trees. It was a lot of trees in here. It had 44 neural nets of different sizes. So that was a total of 2200 hidden units that got added. So it's a large number of weight. It had 115 memory-based learning models in it. And it had a bunch of SVMs, different ones, RBF kernels, you know, whatever. Boosted stumps, you get the picture. So it's a big thing. This particular one takes about a gigabyte to store all those models. Gigabytes are still big things. And it takes almost a second to execute that huge number of models to then take the average prediction to make a prediction for one test case. >>: (Inaudible) sorry. >> Rich Caruana: I'll show you typical results, yeah. But it's not atypical. It's slightly larger than typical because it makes it more fun. But it's not that atypical. Good question. Was that your question as well? >>: Yes. >> Rich Caruana: [Laughter] okay. And I just want to convince you this really is a big problem. I mean so think about web search. What do you get? A billion queries? A week? A month? Whatever? You're not going to execute this thing a billion times. I mean you're not going to even execute it a billion times to cache the results. A billion is a lot. So the test set is large. If you really want to use the thing in real time, you haven't got a chance. It is longer than the delay that's allowed when the answer has to get back to the user. And you can try to parallelize it. If I had 30,000 models in there, maybe you can dedicate 30,000 machines, one to each model but the communication costs -- it's just not going to work. So not going to use it for web search. You're not going to do face recognition by scanning your little box across your image at multiple scales to try to find faces, right? Because, again, the test set is going to be too big. If you're trying to do this on video it would be even worse than still images. Satellites, think about it. The memory they put in satellites is still like PCs of a decade or more ago. Right? So it's hardened memory. So it's small. Means you can't fit these models in memory anymore. God knows you don't want to go to secondary storage to get to them. You can't do that. The processors have very little speed, very little power. Never going to put these on a satellite. Not going to put it on a mars rover. Power considerations alone would prevent you from putting it on the mars rover. The same thing for hearing aids. You could spend a lot of money on hearing aids. People pay money to hear. But you're just never going to have the power there unless you start putting little nuclear power plants in their ears. PDAs and cell phones, now cost is a real issue. Maybe PDA's are getting powerful enough that they could start to do this. Depends on what you're using it for. If it's trying to figure out where you're going or what restaurant you're interested in, maybe you could afford it. But then the cost would be prohibitive. These things are at the margin an extra 10 cents on the cost of a phone is significant. Digital cameras. You've got the picture. You aren't going to be able to use these things for a large number of applications in which you would like to use them. >>: I'm curious, if you sort of prevented it from getting large artificially what the hidden performance is, whether you needed that size or if you can still ->> Rich Caruana: We did try that. And there is work on taking ensembles and trying to prune them to sort of make them smaller. And we never had that much success. And we think the reason why is the diversity is really important to why the ensemble works so much better than the individual models. And as we started to sort of -- we can cut them in size by half and get ultimately the same performance. But half isn't enough to make a difference. We can't cut it in size by a factor of 10 or 100 and keep the same performance, because the diversity seems to be so important. Okay. So this really is a nasty problem. All these applications are sort of out the window. In some sense you might say this is Breiman's constant come back to haunt us, right? It's true. If you want low error, you're going to pay the complexity price and just too bad. You can't have low error on these applications. It's just the way it is. Well, that would be a sad day. If that was the answer, then I would stop now and take questions. So we're going to have an approach to this which is model compression. And what we're going to do is you train that model any way you want. We'll use ensemble selection. We'll train a simpler model to mimic that complex model. The way we're going to do that is a trick that people have used for other reasons. We're going to pass a bunch of unlabeled data through the complex ensemble and collect its predictions. It's unlabeled data. We don't know what the real answer is. We get the prediction from this ensemble because we want to mimic that complex model. We let this thing label our unlabeled data. We now have this very large unlabeled data set, and now we're going to use that large unlabeled data set to train what is hopefully going to be a smaller sort of mimicked copy cat model. If we're successful with this thing it will look just like the function that was learned by the complex thing but it will do it in a smaller package. So that's the game. Let me just show you right away one of the results to show you that this is possible to make sense. This is squared R. So down is good. This is the number of hidden units. We're using neural net as mimicked model here. Neural net is trying to mimic a hidden ensemble. This is the number of units in the neural net. We've done the tricky just described. We've taken a bunch of unlabeled data and labeled it with the ensemble. The performance of the ensemble is this squared R. It's very good. It's an excellent model. The performance of the best neural net we could train on the original data was this good. It's not a bad model, but it's not that good. That's the model we want to deploy. That's the best neural net we knew how to train. This is the best neural net we can train using this sort of compression trick. And you notice we get really close to the performance of the ensemble with -- this is a log scale with a neural net that has sort of 32, 64. It's a modest sized neural net. It's not a billion hidden units down here. This thing is nowhere near the complexity of the ensemble. Depends on how you measure complexity now. >>: Can you give us a sense? Is it a thousand times lower than (inaudible) what's the gap in the complexity? >> Rich Caruana: It's several thousand times faster, and several thousand times smaller. Yeah. So I'll show you some tables of those results. >>: Were that the case -- was it as complex as the one you showed? >> Rich Caruana: This is actually an average over several data sets. I wasn't going to tell you that. But it's actually an average over several data sets. So there's no one ensemble. >>: Is it a consistent lift? >> Rich Caruana: No, in fact, I'll show you some results where that doesn't happen. No, it's a very good question. And if we had enough synthetic data, you would probably never see a lift because the neural net could never overfit. >>: But you could (inaudible). >> Rich Caruana: Exactly. It just takes time. It just takes time to make the data and then time to train the neural net. So we're always trying to sort of do this game with as little data as possible. >>: (Inaudible). >> Rich Caruana: No, no, you're absolutely right. We'll buy the extra cluster and spend the extra month. I agree completely. >>: You're trying to mimic, trying to mimic the dirty outputs or (inaudible) as well. >> Rich Caruana: Good question. Just the output. I should tell you my personal approach to, this is still all boolean classification. My personal approach is always to predict the probability of something being in the classes so I have continuous numbers coming out of these models. We're actually trying to mimic the continuous numbers. Yeah. So, thanks. So that's a preview of things to come. That's where we're going. Now, we'll just talk about this process and when it works and doesn't work and how to make it better. Okay. Why neural net? Well, we tried it with decision trees. I've seen some large decision trees in my day. But never this big. They had to be humungously large before with high fidelity they could model the ensemble. We had to keep modifying the code so they could keep training bigger trees. You have another problem, recursive partitioning keeps running out of data. You end up having to have large synthetic data sets so you can grow very large trees in the first place. So it's painful. It's not very good compression because the tree is huge. You need a huge amount of data to do it. It wasn't a win in our experiment so far. Support vector machines are a little more promising, and on some problems it turns out you can get a reasonable performing support vector machine. On other problems, though, the number of support vectors you needed in order to get high performance, just sort of went exponential. We just needed some huge number. Now, you could use a trick like Chris Burgess's trick way back in '96 for basically pruning away the least important support vectors and trying to minimize that. We haven't done that. So, hey, everybody knows that K nearest unable is bay's optimal if the training set is large enough. No reason to think it wouldn't work well here. The problem is that the training set would have to be extremely large. But there might be tricks with good kernels and clever data structures and pruning of the training set, things that people have all done research on, that might make this feasible in some applications. But it's not going to be feasible sort of simple right out of the box. You'll have to do a lot of work, I think, to make it feasible. The neural nets, we tried these things in the neural nets we just tried them and it was like wow. We didn't have to work to make them work. So we stuck with the neural nets. It doesn't mean that we would ignore these other things. There are times as you're going to see every now and then the neural net has trouble and we'd have to use something else. >>: Did you take any care with the input density to launch the problems you originally trained on and how sensitive was it? >> Rich Caruana: Yes. Yes. So that's exactly where we're going. >>: These are very dimensional? >> Rich Caruana: This is all the 20 to 200 dimensional problems. I don't know how well -- our method is going to depend on density estimation to create the synthetic data. And I don't know how well our method would do in very high dimensions since the world really changes up there. >>: Did you ever compare this to like the deeper (inaudible). >> Rich Caruana: No, I have that in the future work slide that I know we'll never get to. [Laughter] but we haven't done that. And it's definitely a work in progress. I see this as being halfway there. We've got that first really promising proof of concept. It's actually already useful for a whole number of things. But there's a whole lot of additional work you'd like to do to say you really understand it and to know when it's best and to know when you should do something else. >>: Do you think it's critical that you're only asking for one of these parameters that you're using? >> Rich Caruana: No, we (inaudible). >>: Okay. So this is a generic, you don't even know the classification, it could be this type of thing? >> Rich Caruana: Can be used to mimic any function. >>: Or anything in the vendor space? (Phonetic). >> Rich Caruana: Conceivably. Yeah. I think it's completely generic. There's nothing we've done and there's no result we've had that makes me believe it's sensitive to any of those details. >>: Does the previous graph really say that the whole thing was just a question of how much data you had to train in the first place and nothing about the diversity ->> Rich Caruana: I think this is a good interpretation; that if we had had a million labeled points to begin with, we would have gotten a neural net this good or even better in the first place. But we never had that data to begin with. And for some reason that prop and all the variations that we've tried aren't capable themselves of learning this good of a model from the limited training set. So we have to go through this really awkward neural net training algorithm, which first trains a whole bunch of other things and then builds an ensemble and then ha Lou's nets some synthetic data. Labels it with that ensemble and trains the neural net. Turns out that neural net training algorithm is the quite effective. >>: You have the million data points do you still think you would get with the ensemble it would be 10 million? >> Rich Caruana: That's a good question. So one thing we don't know is how much extra lift we would get from these ensembles if we were really in the data rich regime. And there's a chance that just good old boosting, which is such a powerful learning method, if you can feed in enough data to prevent overfitting, there's a chance that good old boosting is going to hit the asymptote and nothing is really going to be able to best that. That's what I believe. But I don't know that for a fact. >>: There's nothing in the method that says you must use synthetic, right? >> Rich Caruana: Right, right. It turns out if you're in a world where you have -I'll show you some results which show that when you've got that, you're golden. That using the synthetic data hurts you. The goal is to make it not hurt you so much that it makes it impossible. >>: You can view the (inaudible) themselves as the output (inaudible). >> Rich Caruana: Absolutely. >>: And just kind of ->> Rich Caruana: That's actually right. I mean all we're doing is we're just training the neural net on a large training set. Now, the large training set happened to come from this other model whose performance we really envy (chuckling) so we did this sort of complex machinations to sort of extensionally represent the function and capture it in this training set and now we're just training a neural net to do it. And that's it. Anything that will give you that large data set. I mean, anything that will give you a large, good training set, you'll be able to train a good model out of it. And it probably wouldn't have to be a neural net. So these are all good questions. And in fact I'll skip through slides later on, because we're hitting all the points now, which is perfectly fine. Okay. So neural nets are good. We're getting surprisingly good compression with the neural nets. I'll show you some results. And the execution cost is low. You know, these are all one hidden layer neural nets. It's just like a matrix multiples. It can be parallelized. They're building hardware to do this a decade ago. So you can make this as fast as you want. It could be a small part of your chip in your cell phone. It's trivial stuff. It's expensive to train the nets because they'll train on maybe hundreds of thousands of millions of points. And any intelligibility you had, we probably lost it when we put it into a neural net. The truth is if you started with ensemble selection, well, you didn't have intelligibility to begin with? >>: (Inaudible) to train with. >> Rich Caruana: I think the stuff only goes up to 400 K. But there's clear evidence that we need to be going over a million on some problems. >>: Small parts ->> Rich Caruana: Say it again? >>: For training. >> Rich Caruana: So we get pretty good results, let's say 500,000. But for some problems you'll need 5 million. >>: How long does it take to train? >> Rich Caruana: Oh, it took us -- we tend to use good old slow back prop as opposed to faster second order methods because sometimes we don't get as good results with them. They seem to more likely overfit. So it's definitely days. And in some cases weeks. But it depends on the dimensionality of the problem and how hard it is to work. >>: What's the size -- this is a 200 (inaudible). >> Rich Caruana: It's interesting. I'll show you some results where there's some problems where the eight hidden network just does it. There are other problems where you need hundreds of hidden units and they're the ones that are harder to learn. It's very interesting. I'll show you the results. If I don't satisfy you, definitely come back and ask me again. And the neural nets, some applications like web, real time, they may still not be fast enough. It could be logistic regression and its cousins are going to sort of dominate in the web world for a long time, just for many reasons. We've got this new problem which we've already talked about, which is where is the unlabeled data come from, if you've got tons of unlabeled data as you tend to have in text applications, web applications, image applications, that's great. I'll show you some results that shows you that's the best thing you could have often you don't have it. Often all you've got is that original labeled training set, you know, which in our case is 5,000 points and you've just got to do the whole process with that. So you're going to have to hallucinate some data. And it's critical to hallucinate that data well, right? You've got some manifold high dimensional space. You really want to a sample of P of X. You want your samples to be on that manifold because you don't want this compression model learning anything off the function off the manifold. You don't want to distract it. And you don't want to waste your samples off the manifold. Right? If you double the size of the manifold in each dimension, high dimensional space, the vast majority of samples will not even be on the manifold of interest. And you've just wasted 99.999 of your samples. You can't do that. You have to stay true to the manifold as much as possible. What we'd like is two to the manifold plus epsilon. Because that's better than being true to the manifold minus epsilon because then you miss the edge of the manifold and the edge of the manifold it could be important. That's what we're looking for and this all gets harder as dimensionality increases. So what are we going to do? We're going to just look at three methods here. One is the simple straw man, just univariate modeling of the density. That's not going to work well, as you know. >>: (Inaudible) might be incredibly hard for ensembles, especially, what if you tried to sample just around where the actual boundaries were? >> Rich Caruana: Right. Some sort of active learning almost, where you're trying to focus on the classification region? So it wouldn't work for regression and it might or might not work if you're trying to predict probabilities but it might be great for classification. >>: (Inaudible). >>Rich Caruana: No, there are times when classification is all that counts. >>: (Inaudible). >> Rich Caruana: I tend to like them. But it doesn't mean to you have to have them in all applications. So it would be fun to talk about. >>: At the end of the talk you're talking about a bunch of different (inaudible) then you're talking about one data set, right? Because you have one feature (inaudible) for different data sets? So which one are we talking about? >> Rich Caruana: Actually, I'm going to average over eight different data sets which are subset of data sets from the early part which you didn't see what they were anyway. So we're going to show you some results for eight different data sets. In some cases I'll just average over all of them to give you the big picture. In a few cases I'll drill down and show you what happens with specific data sets. All right. So I'll just look at the straw man approach. This is Allowance and Dominguez approach, Bayes estimation. I'll describe that and then our approach, munging. So let's say that this is a representation of the true distribution that we would like to sample from. So think of that as an infinite sample of this nice 2-D distribution. And we'd like to be able to get samples like this. So that's just a sub sample of this. That's the kind of thing we need to be able to generate those sorts of things. Okay. One way we do it is just 2-D, we could just model P of X 1 and P of X 2 individually. Maybe model them with a Guassian or use your favorite method it could be peace wise or spline. Any way you want. Just model them separately so we'll lose the conditional structure of the data and let's see how that works. Okay. Of course, it doesn't work. It doesn't know anything about the hole in the middle. It doesn't know anything about rounded off edges. Of course, it generates bad samples. Now, it does cover the space. Maybe that's sufficient. Maybe covering the space it could be more efficient but maybe covering the space is okay. Because that's what we need to cover the space. That's the most important thing. If you don't cover the space you're just in trouble with compression. >>: There's an issue, though, classifier performs well -- you have no know idea what it's going to do off in the whacky space. >> Rich Caruana: The silly thing is our goal is to mimic the thing. Who cares for mimicking it in the regions for which it didn't learn anything interesting and will never see a test case. We only want to mimic it on the manifold because that's the only place where it actually did anything interesting in the first place. And we want to do that also because if you ask a model to learn about a whole bunch of other regions, it takes more, right, it takes more hidden units or whatever to do it. And it's a harder learning problem. It has to learn about a much larger part of the space. Most parts of which are probably just sort of crazy. It's only on the manifold where it makes sense, right? It might just saturate if you're lucky in the off manifold areas which would be at least easier to learn or it might just do crazy, who knows, polynomial curve fitting with too many degrees of freedom. >>: One could argue in order to, because if you have data sets from everywhere, in order for a model to then replicate the sub data to have the complexity of this. >> Rich Caruana: Right, right. It wouldn't necessarily have to have the complexity of the ensemble. I think one of the reasons why these averages over many models are so effective in machine learning is because they actually yield a regularized model. They actually yield something simpler. The function at least is simpler even though our way of representing it is more complex. It just tells you about a weakness we still have in machine learning that we don't know how to fix that problem other than by averaging a whole bunch of things. It depends on what we mean by complexity. Okay. So this doesn't work so well. Clearly what happens is we lose all the conditional structure of the data. But who knows, maybe it will still work for compression. Our goal isn't really to generate good samples. Our goal is to do compression. Maybe this will do it. So here's the experimental setup. We have eight problems. This is their dimensionality, 20 to 200. Most of them are fairly balanced. Only one is this imbalance that has 3% positive class. Training set size is always around 3,000, 5,000 points. We have test beds. Don't worry about it. Just need to tell you that. Here's the number of graphs. This is squared R. This is the performance of ensemble selection. And this is the average over those eight problems. So you're going to be seeing an average graph. That's the target we want to hit ultimately. This is the performance of the best single models for those eight problems. Remember, that is actually state of the art performance. You rarely have that performance yourself unless you've trained 7500 models and picked the best. That is truly excellent performance. You could be happy with that. This is just even better, right? If you can get this, that's great. This is the performance of the best neural nets that we could train on the original data for those eight problems. So that's sort of hopefully we can beat that. All right. So this is what happens. And we're varying the size of this hallucinated synthetic data set that we've labeled with the ensemble. We're varying it from 4K to 400K on the bottom here. And I think you can see the performance, I'll explain why we start at 4K in a second. Performance starts sort of comparable to a neural net. Never really does much better than. It actually does worse in the long run. Why does that happen? God was nice enough to give us 4,000 labeled points. We would be crazy not to include them in the training set. Since ultimately they are the function we were always interested in learning, not the ensemble. So we always start with the 4,000 labeled points that we had in the training set for the compression model. It just would be stupid not to do that. And in this case what happens is eventually you've hallucinated enough bad data and distracted the neural net enough that the 4,000 points get ganged up upon. And the model's just no longer able to concentrate even on that function, and it does actually worse than having not done this at all. So random data is not going to work well. So obviously ->>: I would expect it would have to be worse for the dimension. >> Rich Caruana: Right, worse. It's an average over all those eight problems. If we went to 2,000 problems, you're right, it would just be terrible. Okay. So we have to estimate the density. So let's look at this naive Bayes estimation algorithm. I'll run out of time. Let me do this very succinctly. It's a very cool method. It wants to use naive Bayes to estimate density. We all know naive Bayes is too simple of a model for anything complex. So what they do, this clever thing. They keep carving the space into smaller and smaller sub regions. Every time they carve it, they try naive Bayes in that sub region. If naive Bayes looks accurate enough in that sub region they don't carve it up anymore. They keep carving the space it until each sub region looks like it can be adequately modeled by something as simple as naive Bayes. Now we have this sort of piece wise model. It's a different naive Bayes estimator in all these different sub regions. Some places get tiled with very small sub regions. Some places have less conditional structure. They have very large sub regions. It's all dynamic. >>: It's for density estimation. >> Rich Caruana: It's for density estimation. >>: They did it years ago for classification. >> Rich Caruana: Yes, yes. This is for density estimation. And it's kind of clever. It's efficient, which is nice. And they did it also because they needed to generate synthetic data. So, in fact, it was just very natural for us to take their method and try it because their goal was -- they weren't doing compression the same way we were doing it. They were trying to do different. But ultimately they needed to generate synthetic data. This is the method they came up with. It's a very appealing method. We try it. We could get their code. It not only did the density estimation but it generated samples. Perfect. So we really got to use their method. That's what we get. It's not bad. You can clearly see the ring. You can kind of see there's almost decision tree-like artifacts where they've carved the space up with axis parallel splits and things like that. So you can almost see the way the space is carved. But it's not bad. Okay. So the two problems with it are there is some artifact. That artifact would get nastier in high dimensions. 2-D is a nice world. There's some points that are off manifold. Other points in the middle, lot of points out here, where those points shouldn't be there. But it's not bad. So we were hopeful this might have worked well. So let's try it. So now we have -- this is the random thing. This is the performance we get now when we use this naive Bayes estimator. And it's not bad. So we have the neural net now trained with 25,000 points. 4,000 of which are the original training set, which is now doing as well as the best single model that could be trained. So this neural net is now competing head to head with the best SVM best boosted trees the best anything that could be trained. So that's actually good. That's already perhaps a useful result often enough. So however it does have this behavior that obviously we're not true enough to P of X. And eventually it sort of hurts on average. It's not like that for every problem. Some problems keep coming down. For other problems it goes up. I'll show you some of those results if I have time. So it's not bad. It's promising but we wanted to do better. We had hoped for more from this method. >>: That's the estimation sort of hallucinating area that doesn't matter anymore. Seems as you increase the number of hidden units in the neural net, maybe that curve sort of goes up later. >> Rich Caruana: So we have tried varying the number of hidden units here, and it does make a difference, but it's not as big of a difference as you would have thought. And that's because we do early stopping and regularization on the neural nets. So they're not going -- even when their capacity is large they're not doing insane things because we have validation sets. I mean, it's nice we can have as large a validation set as we can afford to label in this world. So data is suddenly available to us in copious quantities because we're making it up. So it's a nice world to be in. Okay. So we developed our own thing. It's called munging. And if you look in a dictionary, munging will be defined as imperfectly transformed information or to modify data in such a way that you can't describe what you did succinctly. That's exactly what we wanted. No, it turns out this is after the fact a good description of our algorithm. It's not that we wanted those properties. So here's the algorithm. So somebody gives you a training set in two parameters. This parameter P and S which I'll tell you about, and then some label data, the original, and then it's going to return the unlabeled data. Here's what you do. You make a copy of the labeled data. And now you walk through all the cases in your data set. One at a time. You find the nearest neighbor for each case using euclidean distance. K nearest neighbor. Find the nearest neighbor for each case. If you two would stand up. No, you're fine. So you're two cases. You're each other's nearest neighbor. You're the first case I'm looking out and I find you're the nearest neighbor of this case. Now with probability P, we're going to take your attributes and maybe swap them. So we're going to swap your hair color. We're going to throw a coin. If the coin lands heads, we'd swap your hair color. If your coin lands head again, we're going to swap your shoe size and we're going to sort of mix and match you a little bit that way. For continuous attributes like heights, we'll do something a little more complex. We don't want the heights -- maybe the original set only has 50 unique heights in it. We don't want to be restricted to those 50 unique heights. So in fact what we'll do we'll take your height, your height. We'll throw a little Guassian around them, that's where this parameter S is. It's the width of the Guassian we throw around your two points and now we'll draw two new heights from this Guassian and we'll replace each of your heights with those Guassians. But that's it. So we just do that for all of your attributes, and what that does is it creates new guys like you but where we've changed some things. And I hate to say the word genetic algorithm, but it is kind of like genetic algorithms except the nearest neighbor is a critical step here. And you don't see that in GAs. >>: This is a joint question Janet and I raised yesterday. But why not just jitter the data and just sample Guassian unique data point? >> Rich Caruana: So we have tried that. The biggest difficulty is you have to model -- you have to come up with an intelligent model of the jitter, and the jitter model also has to be conditional. If you just do it univariate, I don't know that it's going to work. >>: What I was going to say, couldn't you use the neighbor distribution. >> Rich Caruana: Oh, I see, I see. That might work. That might work. I think you'll still have to come up with these parameters. But you might be able to do it. No, no, that's interesting. >>: Kind of like John's -- they give you the simplex sampling thing that you had. >> Rich Caruana: That's interesting. >>: Sampling along that simplex. >> Rich Caruana: That would be interesting to try that. >>: Now that I've switched that, do you (inaudible) at the same level? >> Rich Caruana: No. Because we actually don't know what your label should be. And since it's a probability, we hope it will have changed. >>: But in those cases where I have checked munging ->>: You're ignoring the label. >> Rich Caruana: You are ignoring the label. We have tried taking the label into account to make this process better. And every time it's either hurt us or not helped us. >>: That seems like it should be the case because you wait until you actually run the -- you're going to run the ensemble on that data. The ensemble has made it where the boundaries actually are. And you want to maintain its version of where that is. >> Rich Caruana: This is just our way of sampling from the ensemble. Plus you know maybe you were class zero and you were class one and now that I've mixed you it's not clear what you should be. And there are weird things about these probabilities. Like it turns out if we do swapping with probability one, that's the same as doing swapping with probability zero, except for the continuous attributes. So values between zero and .5 is what makes sense. We use .2. So anyway we do this. We pass through the whole data set and we do it and it comes up with a new data set. We use about .2 for that parameter. The variance parameter is not too critical. Don't worry about it. You can also modify this algorithm in interesting ways. You could do a bootstrap sample for example from the training set so that every time we have to generate more data your nearest neighbor isn't always the same. Because maybe you won't be there the next time and your nearest neighbor -there are lots of variations to this. But I'll just show you results for one variation with one parameter setting. So remember here walk you through the algorithm, but I think you get the idea. It's a blowup on the small region. This is point that we're picking. We find its nearest neighbor through distance calculation. That's its nearest neighbor. Now what we do we flip a coin and swap some attributes in continuous values we hallucinate some similar points. And we end up generating these two points from those two original points. And we just go through the whole process and we keep doing that, and let me just, because you guys know how this works. >>: Does the successor markers lie on the density of the original data? >> Rich Caruana: Yes. And so especially in high dimensions. Simple things like euclidean distance probably aren't going to be adequate. In some sense you have low density in high dimensions unless the world is very nice to you. So I'm sure this process is going to break down in very high dimensions. It's doing well, though, up to 200 dimensions in our test problem so far. It's not critically sensitive to this. But that is an important issue. You can use it another kernel if you want for measuring finding who the neighbors are. If you have some smart even learned kernel you can use that for your neighborhood function as opposed to using something simple like we're doing to try to mitigate this high dimensionality again. Remember, that's the kind of sample we're hoping for. That's the kind of sample we get. >>: How sensitive was it to those parameters? >> Rich Caruana: It's not that sensitive. We find the probability of swapping .1 to .3 works pretty well. I think we're using .2 in these experiments. And the standard deviation for the continuous attribute, the only problem is if we set that too large, then it turns out that you really do expand the manifold too much. But there's not too much penalty for setting it too small. You just then don't hallucinate as many new values. >>: You impose the variance, you just don't measure the variance between the distance of two samples. >> Rich Caruana: I didn't go through the formula. Actually what we do we have two samples. We calculate the standard deviation from the two samples and then we divide by our control parameter S, which just lets us ramp up or down the standard deviation and say only use half of the standard deviation or use three times the standard deviation. So it just prevents things from going too far into the tails and then getting off manifold. But it could be sensitive on some problems. But it hasn't been on the limited number of problems we've looked at. It does pretty well. So those are samples generated by the munging problem. You can see we get an occasional thread-like structure. You can see in some cases they are sort of where other excursions existed in the real data. But not always. So interesting things happen but it's pretty good. So just to compare, this is univariate random which is not very good. This is naive Bayes estimation which is pretty nice and that's what we're getting. I think you can see that's definitely better. And this is just plotting them on top of each other. This is the goal. This is the real density. Remember, we want the real density plus epsilon. And I think you can see there's a little blue for munging sort of sticking out on the edges and then the green is from naive Bayes estimation. That's sticking out even more and it's off sample and then the red, of course, we never thought that was going to be good. The question is does it work? Okay. So it's a deceptively simple algorithm but it's pretty effective. In fact, I came up with this algorithm originally to do privacy preserving data mining. So we had real medical data. I wanted people to do machine learning on it without screwing up the data, preserving the real properties of the data, but I couldn't give them the real data. So we did munging on the data to sort of obfuscate it and it worked reasonably well. It never explicitly models P of X, unlike other methods that you might imagine it's more the data is the model as in K nearest neighbor. And a critical thing is this nearest neighbor thing. That's what preserves the conditional structure. The fact that we're only swapping values among nearest neighbors is what means the conditional structure tends to be preserved. If you didn't do that -- if we just swapped values between things that were very different, all the conditional structure would be gone. So that's the important thing. Well, how well does it work? I need to wrap up here. That's the graph we saw before. And that's the graph we get with munging. So I think you can see that on average across these problems, as we start getting out to a couple 100,000 pseudo labeled data points, we're actually doing quite well. We're doing better than the best single model, and we're approaching the performance of that super ensemble. So we're getting there. That's very nice. Let me just walk you there you some graphs. This is one particular problem. Random doesn't do very well. Naive Bayes estimation is not so bad, but we do much better -- don't you have to leave, too, now? [Laughter] But you can see that munging is doing very well. By the time you get out to a couple 100,000 points, in fact it's equal in performance to the target ensemble. >>: What's P1. >> Rich Caruana: I'll tell you about this in a second. Yeah, good. Here's a different data set. This is really, really nice. You love it when this happens. I mean mung is working very well. It's better than the ensemble. We are actually training a neural net that is somewhat better than the target ensemble. That's great. I mean the neural net is sort of an additional regularizer on top of the ensemble. So you shouldn't be surprised. It could be this ensemble wasn't so big in the first place. It might have had some rough edges and maybe the neural net is really doing a good job ->>: Sample trained on the training site of (inaudible) I guess the labels? >> Rich Caruana: Right. So the ensembles train on the original 4,000 labeled points and then this is trained on that 4,000 plus a certain amount of hallucinated points. This is great. Sometimes this happens. And that's wonderful. Now it's really a super, super neural net. It's better than even this other thing, which is really good. >>: What if you get one ensemble in a smaller training set and use that ensemble to label more hallucinated data points and build another ensemble. >> Rich Caruana: Oh, oh, that's interesting. Try to lift yourself up by your own ensemble straps or something. No, that's very interesting. Wow. I'm having trouble getting my head around what that would do. My guess is it would eventually hallucinate and go south. But there is a chance that it would just keep self regularlizing in an interesting way. I have to think about that one. >>: I think that's triple I or W -- comes down to something very similar. For small data sets from boost increase. >> Rich Caruana: That's right. So this is University of Texas, Ray Mooney and ->>: (Inaudible). >> Rich Caruana: Thank you. I think I have it in the related work. There they use random. They don't use a sophisticated way of generating random data. But they generate this extra data to create diverse ensembles and take the average of the more diverse ensembles so the random data has forced diversity and that extra diversity pays off when you make the super ensemble and they get better performance. There's a chance this would have -- I like that question. >>: It sounds like the successive munging seems to -- I mean if you were to try to prove something about the success rate, it would probably be involve like assuming that the space of nearest data comes from -- can be well modelled by locally smooth rectangular patches. >> Rich Caruana: Right. >>: Basically the probability distribution is kind of locally exchangeable. >> Rich Caruana: Uh-huh, I think you're absolutely right. Fernando Perrera saw this one at a post run a couple of years ago. He said it's really interesting. Seems to work really well, but, man, you don't know why it works. What could you prove about it? And that's one of our future works is it would be nice to have some idea of under what conditions is this most likely to work. And under what conditions would it fail. >>: (Inaudible). >> Rich Caruana: If nearest neighbor is really -- hard to say. Because our goal. >>: Interleaving two classes. >> Rich Caruana: Yeah. As long as the ensemble has learned the two classes our goal is ultimately just to mimic the ensemble. So as long as the ensemble could do the two classes it's okay if the density estimation is not perfect. It doesn't depend -- we want the density estimation to be as good as possible because then everything else is more efficient and we get better compression. But it doesn't have to be perfect. The ensemble, though, has to be as perfect as possible because it's ultimately the target. Okay. I better wrap up. So let me -anyway, sometimes we actually do better. This is average results of problems by number of hidden units. I think you can see we do sort of need a modest number of hidden units on average. 128, 256 to be able to this compression. Here's if we look at these problems again. And this is where I'll tell you what these are. Letter P 1. Turns out 800 hidden units is almost enough. Letter P 2, it turns out 128 still isn't enough. You need a lot. What's the difference between these? Letter P 1, we're just trying to distinguish O from any other letter in the alphabet. Letter P 2 is a much harder problem, distinguish the first half of the alphabet from the second half of the alphabet. That's a hard problem. >>: Is this OCR, that people don't know ->> Rich Caruana: This is UC Irvine. This is a much harder problem. And interestingly, this thing tells us, oh, I need a much bigger neural net to learn the function that the ensemble has learned. So that's kind of interesting. This might suggest that there's a way here of coming up with some crude measure of the intrinsic complexity of a function or of a data set by sort of just saying, oh it was a 10 hidden unit problem. It wasn't an issue. Notice, you couldn't tell it was a 10 hidden unit problem using the original data because now there's all sorts of overfitting issues, and sometimes big nets fit over less, and that's complex. But here you can talk about overfitting in a controlled way because you have this sort of very large data set. So that's kind of interesting. Doesn't always work perfectly. And then I will stop. Here's the tree cover type problem. You can see that even with 128 hidden units, although we're going downhill, we need a much bigger network. The compression is not going to be as good on this problem if we need a thousand or 2,000 hidden units. It's no longer a timing model. >>: The first two classes? The two big classes? >> Rich Caruana: I think we took the big class versus all other classes. >>: Okay. >> Rich Caruana: Good question. I think it's a seven-class problem. So we converted it to binary by doing the major class versus all other classes. Thanks. Let me just show you this. See this gray line -- this is the line you saw before for munging for this data set. We need a lot of data. We not only need large networks but we need 400,000, 800,000 millions of points. This gray line, this is a large data set. We actually can take a lot of the real labeled data, throw away the labels, treat it as unlabeled data and then label it with the ensemble and we can ask what would happen if we had real unlabeled data from the right distribution. And notice it is significantly better than if we have to create synthetic data. If somebody gives you the real unlabeled data from the right distribution, almost always that's going to be significantly better performance by doing that. Now, maybe we'll eventually get to the same place by having to create 10 times as much synthetic data. But boy if you have data from the right distribution, it's a good thing. And I think I should wrap up. Here's a problem that's just -- I'll end with this last problem. This problem is annoying. This is also a UC Irvine problem. And you can see although we do marginally better than the best neural nets, the best neural net on this problem, we don't do anywhere near as well as the best single model which wasn't a neural net. Or as well as the ensemble. Notice, there's a big gap between the best neural net and these other things. This seems to be a problem for which neural nets do not do well. It's like neural net hard in some sense. And that's interesting. So we're interested in ->>: (Inaudible). >> Rich Caruana: Yeah, but some of the other data sets are very noisy, too. The neural nets actually do very well. I think it's also a sparse one way of coding this it creates pretty sparse -- the truth is neural nets in very high dimensions where the data is always sparse are doing quite well. I don't know what it is about this problem. But it's interesting we may be able to find problems that are just sort of one hidden layer neural net hard. For some reason it doesn't work well. And this is one of those cases where you might really need to use an RBSVM to do well on the problem. We might have to use something else. >>: Given that your target is random performance, though, you could try one of every single model, right? As long as it's cued, as long as the model performance is cued. >> Rich Caruana: Right. You could try everything and see what gave you the best compression for performance. >>: Because that happened to be over the course of everything. >> Rich Caruana: Yes, you're absolutely right. You could imagine a nice 2-D plot, which was the compression or speed-up, and then on the other axis was the accuracy preserved. And you could pick wherever you wanted on the trade-off curve, right. Whoops, I didn't mean to do that. Let me just summarize the results and then I have four seconds. Here are our problems. This is the squared loss of the ensembles. Let's just look at the average here. Speed this all up. We're able to preserve on average -- so how do I measure this 97%? I take the squared error of the best neural nets we could train and then I take the best squared error of the ensemble, which is our target. And then I ask, when we compress, how much of that improvement do we capture? 97% means that we get 97% of the way from the best neural net to the ensemble model. So that's a much stricter way of measuring 97% than, say, taking baseline performance and saying we capture it. Because there we would capture 99.99%. We're actually starting from a good model, good neural net. We're saying it has to be much better than that to be 97%. So we actually capture 97% accuracy of the target models. That's only going out to I think 128 hidden units. We know on one or two of the problems we would do better with more. That's only going up to 400K training points. So how about the compression? Since that's the goal. Well, here's the size of the models in megabites. The ensembles on average are half a gigabyte. The neural nets that you could train just on the original data are quite tiny. The neural nets that we train to mimic this ensemble are bigger on average, you'd expect them to be, are bigger on average. This is a little bit of Breiman's constant sort of coming in there. They're bigger on average than these humbler neural nets but not that much bigger. And on average we're getting a compression factor of about 2,000 of the neural nets compared to the ensemble. But the ensemble could be anything. It doesn't have to be our ensemble selection. It could just be any high performing model that you have lying around. So this comparison is almost not so interesting. What I find interesting is this comparison. Right? Which is that you think of this as .1 and .3. By making the neural net just three times larger than you would have naturally made it we get all this extra performance. That's really what counts. Because who knows what your target might have been. That's what counts. Three times the complexity we get a great neural net out of the thing. In terms of speed, we get similar numbers. So this is the time to classify 10 K examples so this is the number of seconds. The ensembles are quite slow. Half a seconder per case. The neural net is very speedy, two seconds to do 10,000 and you could make that much faster if you wanted to. This is just a trivial implementation. We're just a little bit slower because we're a little bit bigger, as you'd expect. Not that much slower, and on average we're speeding things up by a factor of a thousand. That's what we ultimately need. But, again, this is probably the more interesting comparison. We get all that performance but by only making a neural net about two or three times bigger and slower than had it been had you trained it naturally. That's what I like. Here's a summary. I think you know that. So let me just -- I think you already know why it works. So I think all of our discussion has been about why it works. I just want to say clearly the fact that you have a large ensemble does not mean you have a complex function. In fact, I think as we said earlier often these ensembles are reducing complexity not increasing complexity. So I probably should skip this. I think I probably should skip it, unless you want to stay for three or four more minutes. All this work has been done with squared loss. Remember those big tables had AUC and log loss and accuracy and F score and all those things. Everything I've just shown you was just with squared loss. So an interesting question is, well, hey, if we compress an ensemble trained to minimalize AUC and train the neural net with square loss to do that, will it turn out to be good on AUC? We typically see what I call the funnel of high performance. This is from a different set of work we've done. This is accuracy. So high accuracy is over here. High ROC area is over here. Anybody who has to go, please go. And we often see this kind of behavior which is once you get to very high performance on a problem, this is two different metrics. Just a scatter plot of two different metrics. Lots of different models. Neural nets. Bag trees, boosted trees, just wherever they fall, they fall. Once you get to very high performance, there's very little, if you're a great AUC, well, then you can have great accuracy by finding a good threshold. You can have good probablistic prediction by doing something like plat scaling. Basically you grade AUC you can do well on almost any other metric. Any reasonable metric is sort of going to be great if you have good AUC. And that's true for most of these metrics. So all the metrics, when you get to the very high performing end, the funnel narrows. So to the extent that we can get our target ensemble in the high performing end, then I'm not too worried. It will turn out the neural net can mimic it. It can be good on any metric, modular some small epsilon. However, on hard problems, for which we can't hit the sort of high performing asymptote, face it a lot of problems fall out here it's an open question on whether us training a neural net to minimize square loss will do it. Ultimately, if we have enough data and we exactly mimic the real values that the ensemble is predicting then we'll be equally good on all metrics, because once you hit the target function you've got it. But there really is a question there of how hard do you have to work to get this to be as effective with other metrics. And I won't mention all these other things. There's just active learning would be fun. Better ways of doing density estimation. When should we be using deep nets or some other model like pruned SVMs to do compression instead of this. Should we calibrate the models before we do the compression? Should we take the ensemble and calibrate it. Or should we train the neural net on the uncalibrated thing and then calibrate the neural net afterwards. Or should we calibrate twice before -- I mean there are all sorts of funny extra questions, does this all fall apart in high dimensions? Let me just mention one piece of related work. So the early work of Craven and Shavlick, and I think Towell (phonetic) did some of this even before then, what were they doing? They had neural nets which were then the highest performing models I think we knew about on average. But they weren't intelligible. So they were creating synthetic data to pass through the neural net to then train decision tree to mimic the neural net so they could understand the decision tree and thereby understand the neural net. And I think it's just ironic that in some sense we're doing the opposite. We're taking these ensembles of decision trees or ensembles of lots of things and we're going the other way and we're trying to put them into a neural net in order to get a compact high performing model. We're using intelligibility in the process, but I just think it's funny that they were doing something so similar 15 years ago. And I really should stop. So thank you and I apologize. (Applause)