>> Xiaodong He: So let's start. So, everyone,... Professor Geoff Webb. Geoff is a professor in the...

>> Xiaodong He: So let's start. So, everyone, it's my great pleasure to introduce Professor Geoff Webb. Geoff is a professor in the Faculty of Information Technology at Monash University, Australia, where he heads the Centre of the Research for Intelligent Systems. His primary research area machine learning, data mining, and the user modeling. He's editor-in-chief of Data Mining and Knowledge Discovery, co-editor of the Springer Encyclopedia of Machine Learning, a member of the advisory board of Statistical Analysis and Data Mining, a member of the editorial board of the Machine Learning, and was a foundation member of the editorial board of ACM transactions KDD. He was a co-PC chair of the 2010 IEEE International Conference on Data Mining and a co-General chair of the 2012 IEEE International Conference on Data Mining and received the 2013 at IEEE ICDM Service Award. So today Professor Geoff will talk about large-scale Bayesian network [inaudible]. Now I'm glad to have Professor Geoff here. >> Geoff Webb: Thank you for those kind words. Oh, how am I supposed to get this out of the way? Okay. So ignoring the online audience of millions, we're a small group here. So please feel free to interrupt as we go along. Happy to expand on detail. So what I'm talking about is a continuation of the research I've been doing for a number of years into essentially extensions [inaudible], so how to take the great scalability of Naive Bayes and expand upon it. And we moved recently on to issues of how to best learn from large data. I'm sure I don't need to explain to any of you the ever-increasing need for learning from ever larger quantities of data. But most people's response to the problem of learning from large data appears to be how do we take our existing algorithms and scale them up. And my argument is that learning best from large data is not a question of how do we make existing algorithms best cope with the computational challenges of large data but rather that we need fundamentally new algorithms. So let me start with giving you a very simple overview of why I believe this. I'm sure we're all very familiar with the idea of a learning curve. So as data quantity increases, error typically starts at a high level and drops and at some asymptotes. So you can take any learning and plot a learning curve. And, of course, different algorithms are going to have different learning curves. And in particular we will see that algorithms that are able to provide very detailed descriptions of complex multivariate distributions are going to tend to overfit when data quantities are small. So they're going to be outperformed by low bias -- by low-variance algorithms for small data but they're going to tend to perform better on large data. >> Xiaodong He: So this RMSE, is that training data or is that test data? >> Geoff Webb: So this is on test data. >> Xiaodong He: I see. >> Geoff Webb: So this has been done on learning curves where we take -- so we hold out some data and then we do ever increasing training set size. But all these examples, I'm going to show a couple of these curves, and they've all been taken on the Poker Hand dataset because it provides very nice illustrations. But the same point, the curve is not always as fairly smooth as this, but the same thing holds for many, many different types of datasets. The process is fairly well understood as the bias-variance tradeoff. Now, most machine learning research has been conducted with things like the UCI dataset where the majority of datasets are actually no more than 10,000, in fact, even less than 1,000 examples. So they're way, way, way down the bottom end of these learning curves. So algorithms that are going to be good here look extremely bad in the space where most machine learning research has been conducted. So what we need is algorithms that can closely fit complex multivariate distributions. But the majority of machine learning research, I believe, has just ignored these because they actually look very bad in the space where the majority of research has been done. We need algorithms that are both low bias but are also very computationally efficient. All right. So we have the low bias. Being computationally efficient means you cannot spend a lot of time on each training example. Why not? Because if you've got large numbers of them, then you simply have multiplied large numbers by a large amount of time, it takes too much time. And at some level of scalability, if you're going to use all the data, you have to be processing out of core. Now, most existing -- so most state-of-the-art low-bias algorithms do not scale. I'm sure you're all familiar with this. And typical low-bias algorithms, state of the art, typical examples are random forest, support vector machines, and neural networks. These are all inherently in core. You need to look at every example many times. So you cannot process them out of core. So what I'm going to talk about today is the selective KDB classifier which is a scalable low-bias Bayesian Network Classifier. The structure of the talk is the introduction, which I've just given. I'll talk very briefly about Bayesian Network Classifiers. I think most of you should be pretty familiar with the background, but I'll show you the technology I'm using and talk about the new algorithm, give you some experimental results, of course, being an experimentalist, and then talk about what next. So Bayesian network classifiers. What are they? Well, they are defined by two things. Parent relation, so what links are we going to include in the classifier, and then the Conditional Probability Tables which provides the conditional probabilities that are required. Then we classify given some class using the posterior probability as proportional to the joint probability of the class and the set of attribute values, which is the probability of the class, given its parents, times the product of the probabilities of the X values given their parents. And usually what we do in a Bayesian Network Classifier is we make the class a parent of all of the other variables. And I have some illustration of a Naive Bayes classifier, which I'm sure you're all familiar with. Now, one of the very nice things about these types of classifiers is once you've selected your parent relation, once you've got the structure, you can learn the Conditional Probability Tables very quickly by just look -- just uncovering the joint counts that understand lie it. So in a single pass through the data, you know which joint frequency counts you need, in a single pass and also in an incremental manner. So a very nice thing is once you've got the structure, you can just keep refining a classifier by just keep updating the counts. You never need to relearn. So the particular Bayesian Network Classifier that we're working on -- I'm talking about here is the k-Dependence Bayes. So this is a recently old one, mid '90s, I think. Mehran Hasami -- Mehran Sahami developed this. It requires two pass learning. So in the first pass, you developed the structure, you're going to use second pass, simply fills in the counts. And in that first pass you collect counts necessary for both the mutual information between the attributes and the class and the conditional mutual information between each pair of attributes and the class. And you then order all the attributes, so you sort the attributes into an order based on the mutual information on the class, highest mutual information, so therefore the most informative to the left and the less informative to the right. And then you go through the attributes in turn selecting parents, so the classes of parents of every attribute, and then the parents have to be selected from those with high mutual information and you select up to K, so K is a user-defined parameter. You select the K high mutual -- high -- or K earlier attributes that have the highest conditional mutual information. So if we're looking at the parents for X4, look at which of X1, 2, or 3 has the highest conditional mutual information conditioned on the class. So it's the highest mutual information with X4 conditioned on the class. >>: So -- sorry. So [inaudible] X4 for X5 is because the [inaudible] information [inaudible]? >> Geoff Webb: So what I've illustrated here is KDB with k equals 2. So every attribute can have -- has the class as a parent and at most two of the prior attributes. So there's no prior attributes here. There's only one, so it has to have this. There's only two, so it has to have both of them. Here there was a choice of two of these three, and it took the two that had the highest mutual information with X4 conditioned on Y. So that didn't include X1. And here we had a choice of 2, and we've left out X4 and X2. >>: Okay. So that's just on your first pass. >> Geoff Webb: That's done in -- right. So the first pass collects the two-way tables of the joint frequency of each attribute and each attribute value and each class value to work out the mutual information with the class and then the three-way joint frequencies of each pair of attribute values and each class value in order to work out the conditional mutual information between each pair of attributes conditioned on the class. >>: Remind me how you deal with reels. >> Geoff Webb: With reels. So with all of this we're doing discretization. So the discretization we're going to use is equal frequency five bin. >>: Okay. >> Geoff Webb: Okay? So one pass learns structure, second pass learns Conditional Probability Tables. So this is quite a nice algorithm for many points of view. So it has training space. So the first term there is the size of the joint tables for the conditional mutual information, so each pair of attributes, each pair of As. So Y is the number of classes, A the number of attributes, V is the average number of values per attribute. And then the second term is the complexity of going -- sorry. This is training space, so the second one is the complexity of the count tables that you've learned in the second pass. All right? So they have to go down for each attribute, do all the combinations of values for it, the class, and all of the attributes that are its parents. Classification time you only need that second part. You abandon this part. Training time, you've got -- so this is the first phase, so you've got the collecting of the -- so you've got the collection of the tables, you've got the calculation of the conditional mutual information, and then you've got the collecting of the table. And depending on the type of data you're dealing with, any of these three terms might dominate. We can see that it's linear with respect to the data quantity, which is very nice. And classification time is quite fast. You just go along you need to for each class you're going to work out for each attribute, you send a tree over K values to find the correct count. Now, an interesting thing about KDB is you've got this variable K, and it very clearly controls a bias-variance tradeoff. So in my first plot I was actually -- my first set of learning curves, the two algorithms were KDB with k equals 2 and k equals 5. And here I've plotted KDB with k equals 0, which is Naive Bayes through to KDB k equals 1, 2, 3, 4, and 5. We can see that 5 is still a long way from asymptoting. Okay. So we've got a nice way of controlling bias and variance, but we've got a problem that we don't know what is the right value of this for any particular dataset. The only way we can work this out is by actually trying to ->>: Didn't you say [inaudible] 4 and 5 flipped? >> Geoff Webb: So -- so 4 is getting close to asymptoting; 5 is going to asymptote out here somewhere. >>: Oh, okay. Looks [inaudible] hard to read. >> Geoff Webb: And so this -- 4 is here, 5 is here. >>: So you [inaudible]. >> Geoff Webb: Yep. >>: [inaudible]. >> Geoff Webb: Yep. So it's very clear that if you have enough data, a high value of K can never hurt. Because all you're doing is adding additional links. And if you've got enough -- enough data, then even if they're irrelevant, they'll just -- the data will tell you that and will factor itself out. >>: Okay. >> Geoff Webb: Okay? So adding K only hurts insofar as the probability estimates that you develop from the data are inaccurate. And the higher the K, the less data each of those estimates is taken from, so the less accurate they'll be unless you have an enough data. So somewhere here this has to at least join this one, if you just get enough data. Okay. So there's no way to select for any given dataset in advance. But some -the attribute independence assumption might actually be correct. So 0 might actually always be the best value. The other ones will eventually asymptote to this, but for a starter, how they want. So we don't know if from that point ->>: What dataset are you showing here? >> Geoff Webb: So this is the Poker Hand dataset. But the same thing has to be true for any dataset. Because the additional links are only harmful insofar as the probability estimates are inaccurate because you don't have enough data. So if you go to infinity, the higher K is always going to, in the worst case, asymptote the same value as a lower K and can asymptote to a bit of As because they can describe a wider range of distributions. And also ->>: There's no -- there's no attempt to reduce the variance, like the growth of your table grows exponentially with K, right? >> Geoff Webb: Yep. >>: [inaudible]. >> Geoff Webb: Yep. >>: And so there may still be events that you haven't seen enough, you're not attempting to reduce the variance of -- what do you do with zeros or things like that? >> Geoff Webb: So we -- we're doing an M estimate to smooth off the estimate. >>: [inaudible]. >> Geoff Webb: Yep. We're doing an M estimate on all of the -- all -- all of probability estimates that we develop. Spurious attributes may also increase the error. And, again, if there's enough data, they wouldn't matter. But if there's not enough data, they're going to introduce a slight amount of noise. And, again, we've got no way of selecting that. Now, an interesting observation about the KDB classifier is that a full KDB classifier actually embeds a whole lot of simpler classifiers. So for any MIJ where we're doing I is a value for K and J as a number of attributes, so we're ordering the number of attributes here by mutual information, because that's the way in which the classifiers developed, so any MIJ is minor extension of MI minus 1J and MIJ minus 1. So, for example, with the KDB classifier we illustrated earlier, you can simply take out one of the level of links. And here we have the KDB k equals 1 where we've just got for each attribute the parents with the highest mutual information. And you can see I think how to have encoded the full one, we have to have also encoded the lower level one. Or we could just take out the last attribute. And, again, to have encoded the full one, we have to have also encoded this one. So we've actually -- in forming the full classifier, we've actually also created all of these other classifiers. So the very simple trick that we're going to do is in one more pass through the data select between all these classifiers. And we're going to do it using leave-one-out cross-validation. So the full model subsumes K times A submodels. Right? So A's number of attributes, K is the value of -- actually, it's K plus 1 minus A because you've also got k equals 0, the Naive Bayes model. So each of these is a very powerful model, and we're going to very efficiently select between a large class of strong models. >>: [inaudible] leave-one-out only in phase two? >> Geoff Webb: No, we're going to do it in phase three. >>: No, no, but how does phase one and two run? >> Geoff Webb: Phase one and two is learning full KDB model. Phase three is now going to be ->>: You're using all the data? >> Geoff Webb: Yep. >>: So the leave-one-out is a little bit of a cheat, right? >> Geoff Webb: No. Why? >>: Because you've used all the data to learn the structure. >> Geoff Webb: But when we do leave-one-out, we're going to leave it out in order to -- or to learn the structure. So depends what you think is being cheated. So possibly. But we're looking at very large data. And we're not claiming to learn exactly the same ->>: But it's not equivalent ->> Geoff Webb: -- as you would ->>: -- of learning [inaudible]. >> Geoff Webb: It's not equivalent ->>: But it's close. You're [inaudible] fudge. >> Geoff Webb: Yeah. Well ->>: [inaudible]. >> Geoff Webb: -- I'm not really saying that it's -- >>: [inaudible]. >> Geoff Webb: Yeah. It's not really a fudge. I'm not claiming that it's exactly the same as if you did leave-one-out cross-validation, setting a different value of the number of attributes and a different value of K and then compared all the results. I'm not claiming that it's equivalent to that. So why leave-one-out cross-validation? Because leave-one-out cross-validation is a very low bias estimator of out-of-sample performance and because Pazzani's trick makes it extremely efficient for Bayesian Network Classifiers. What's Pazzani's trick? Pazzani's trick is you collect the count tables that you need to estimate the probabilities and then when you come to classify one thing, you simply subtract it from the count table. So very efficiently you can perform the leave-one-out. You don't need to learn a new model each time. So because the full model subsumes all the other models, to evaluate all the submodels actually doesn't take all that much more computation than just evaluating the full model in this way. So it's very, very efficient. So the resulting complexity, well, the space is exactly the same because the full model encodes all the models that we're dealing with, so we require no more space. At classification time, we can actually save space because we now might have smaller values of A and K, so we can drop out of the unneeded parts out. And the complexity -- the only increase is this final additional pass, which is as previously -- so previously we had the number of examples times number of attributes times the value of K, because we went through doing the counts. Now we're going to have to look at some for each class in the final pass to do the evaluation. So it's a slight increase on the complexity. Often in practice one of these two -- sometimes actually just compiling the conditional mutual information will still dominate the training time. So this is not the dominant term. And classification time, again, we may actually be faster because we've dropped parts out of the model. Okay. So this is back to our curves for KDB equals 0, 1, 2, 3, 4, 5. This is what happens if we do our trick and we don't learn -- right, so we're setting the number of attributes, so we're using the full attributes. We're just selecting the appropriate value of K. You can see we've done very well. We actually improve a little bit upon the best K because sometimes the lower K is performing better perhaps. So this is the average over 10, 10 runs. Some runs one did better than the other, and it's probably selected the wrong one. Here we're overfitting and we jump to 4 earlier than we ideally would. And quite badly overfitting something that we want to do something about and something that I'm sure has maybe given us a good pointer as to what we should do. Here is if we just take KDB is 5 and select attributes without selecting K. So you can see we substantially improving upon the version without attribute selection. And here is where we do both together. So we're not doing much in the way of attribute selection with small amounts of data, and then we have this overfitting effect where we're jumping to -- so it happens that with the attribute selection k equals 4 and k equals 5, both track along the lines, so they're both fairly equivalent. >>: [inaudible]. >> Geoff Webb: Yeah. So these are plots of results I got last weekend. So I haven't had time to work out the exact reason for the overfitting. I've got some ideas about how to avoid it. But yes. I don't have a good explanation of why they are such clear points. So the -- so sudden one here, sudden one here, sudden one here, and it's clearly an extreme, extreme event. It may be something to do with the Poker Hand data, which is about Poker Hand, so in some ways fairly artificial dataset. And there are I think 10 classes, which the different types of hand and with -- with some of the classes there are only very small numbers of examples. So these may be the points at which you appear to get enough evidence to start to be able to accurately classify another class and perhaps it's mistaken in that. >>: [inaudible]. >> Geoff Webb: Sorry? >>: How many attributes? >> Geoff Webb: How many attributes? These are only a small number of attributes, 10 attributes. >>: [inaudible] look at this one, the two -- two algorithm for the key selection [inaudible] prematurely jump to another ->> Geoff Webb: Yep. >>: -- K value, and that one [inaudible]. >> Geoff Webb: That's jumping to the wrong number -- yep. Yep. >>: [inaudible] for K and also very large number of attributes may become smoother. >> Geoff Webb: May become smoother. May also become worse because it may ->>: [inaudible]. >> Geoff Webb: Yep. Or it may become actually -- may jump all over the place. Okay. So now on to some experimentation. So we've taken the 16 largest datasets that were for this type of attribute and value, learning that we're able to get our hands on. Most of them have numeric data, so that's been -- that's been discretized. And while they're large in terms of classical machine learning research, going up to 54 million, we should note they're really quite small in terms of real-world applications these days. But there's a fair variety in terms of the dimensionality and the number of classes. So we're doing the discretization, and we've already discussed. So, first of all, let's look at performance against KDB. So if we take the -- right, so here the selective KDB, we're always using k equals 5. Right? For no very good reason. So we can pair selective KDB against the full -- full classifier. This is plotting root mean squared error. Below the line means selective KDB is doing better. And you can see it's always doing better if only marginally. Sometimes very, very substantially. This is plotted on a log scale. And here what we've done is we've observed after the fact which value of K is best. This is in some respects a theoretical result. So we've taken the best performing K and plotted it against selective KDB. And we can see that we're still sometimes performing substantially better, and this is clearly the attribute selection that's doing this. What sort of cost does it have in terms of training time? So here we're comparing against out-of-core Bayesian network classifiers. So Naive Bayes, we all know, 10, some of you may know, so that's a bit like KDB with k equals 1. AODE is my group's extensions to Naive Bayes. This is the full KDB classifier. And selective KDB again plotted on a log scale. We can see training time, sometimes the increase in cost is not too much. Sometimes going from KDB there is a substantial increase in time. So varies somewhat. Classification time, you can see we're doing quite well. We're always no worse than KDB k equals 5 for obvious reasons. Usually substantially worse than Naive Bayes, but sometimes not even all that much worse than Naive Bayes. Okay. So what about things ->>: [inaudible] the slide? Where's the performance, the accuracy of these? >> Geoff Webb: So accuracy ->>: [inaudible] how much better are you doing in terms of ->> Geoff Webb: Yeah. Okay. So I've left out the performance here. We're going to show a summary at the very end of this, show the RMSE. So we showed against KDB k equals 5, which actually outperforms all of these on these larger datasets. So what about other out-of-core alternatives. So stochastic gradient descent in Vowpal Wabbit is perhaps state-of-the-art in out-of-core classifiers. So here we've used all of the default settings except in that we've looked at both squared and log loss. We're using quadratic features which provides the best performance. So for those not familiar with what that is, that means we take each combination of attributes. As the base features. And we're trying a number of passes. So three passes gives us something equivalent to what we're doing, and also ten passes turned out to not perform any better. >>: And what is [inaudible]? >> Geoff Webb: And so Vowpal Wabbit you can train through any number of iterations through the data. Discrete attributes have been made into binary features, which is the appropriate way of dealing with them. And for multiclass classification, we're doing one against all. So a nice thing about KDB is that it very nicely handles multifeatures. So if we look with a squared loss, we don't easily get a probability value out of it. So the fairest comparison is on 0-1 loss rather than on the accuracy of the probability estimates. And on the squared loss selective KDB gets lower error eight times and Vowpal Wabbit seven times, so fairly close. You can see there's one fairly strong win on 0-1 loss, which is the U.S. Postal Service extended dataset, which is a large amount of sparse numeric data. So we're probably losing out in the discretization, but I think also sparseness is very, very important. Vowpal Wabbit is able to take advantage of that in a way that we can't. And here we are with the logistic function which does convert appropriately into a probability estimate. And we're doing root mean squared error. This biggest win is again the U.S. Postal Service extended, and you can see that the win there is much less extreme. With the logistic function, which performs better, observation that requires far more computation and we've not been able to complete the computation for two of the datasets, which are shown as Xs here. And we've actually used the Vowpal Wabbit there. If we look at training time, you can see that perhaps again plotted on a log scale selective KDB is only that one case where we're performing poorly, which is very sparse data requiring more computation. Remember, this is just on three passes through the data. And often requiring substantially less computation. And classification time it also tends to be faster. Again, this one outlier. If we compare against in-core state-of-the-art techniques, so I've chosen random forest as the exemplar of in-core state of the art, because it's not parameterized, we've just taken the Weka version and run it with the fault setting. So a problem I have whenever I compare against the state of the art is someone always says why don't you use this -- this setting, random forest I found results in the fewest arguments along that. The state of the art in terms of BayesNet learners, so this is the Weka in-core hill climbing search for the best Bayesian Network Classifier. That's random forest again. Below the line is selective KDB, getting the lowest RMSE. Above the line is the alternative winning, you can see that we're never doing substantially worse than because BayesNet and actually doing far better. Because we -- and perhaps one reason for this is because we're not doing a hill climbing search. We're actually taking a very strong family of learners and looking -- looking between them. So it's a little bit like doing a search from the full classifier backwards rather than hill climbing from Naive Bayes up. Sometimes random forest does substantially better. These Xs are where we aren't able to complete the computation on the full dataset, so actually only learning on a sampled dataset. So that's possible that we'd actually perform. So we would be able to get performance by learning on the full dataset, which is not feasible with random forest ->>: [inaudible] the sample, your [inaudible]? >> Geoff Webb: Both. So here we've compared performance on a sample for these datasets. Because random forest can't be completed on the full dataset. We can complete on the full dataset, so we could get better performance on that dataset. And -- and ->>: Isn't the [inaudible] what's the best random forest can do, if you have to sample, you have to sample, versus the best you can do? Like that ->> Geoff Webb: So -- so we've compared -- we've made equivalent comparison, right, so both are working on the same sample. So ->>: I understand, but like in that X [inaudible] significantly ->> Geoff Webb: Yes. >> If you use the full data, would you be better? >> Geoff Webb: We would be better, yes. >>: Okay. >> Geoff Webb: Yes. So -- >>: Isn't that your main thesis? >> Geoff Webb: Yeah, so -- so I'm trying to give a very clear picture of what our performance is. I mean, random forest, if you had more compute power, you guys would be able to run random forest on these assets on the -- on the full data. So the -- we're clearly doing much less computation. And the only claim I want to make is not that we're performing at a higher level than in-core classifiers, but I'm wanting to make very clear, incontrovertible, that we are actually performing at a very comparable level to in-core. So out-of-core, we can perform at a level which is very similar to what you can achieve with random forest in-core. All right. And this is, again, our log scale. The time comparisons. Here we are comparing C coded against a Java thing, so you can't pay too much attention to this. And the key thing is that we're out-of-core. These are in-core. And we also bearing an extra cost because we're actually having to load the data off disk three times whereas these are loading off disk once. And classification time, we're often faster than -- random forest always faster than BayesNet. So here's the comparison of all of them to keep Ronnie happy. So takes a little while to get used to what this is all about. We've got color coding of each of the datasets. We've taken the error and ranked this. So a rank of 1 means we've got the lowest error on that dataset. And then each of these plots is 1, 2, 3, to 8. That's how we can see Naive Bayes is almost always 7 or 8. AODE, which is improvement on Naive Bayes but really an improvement for small data still. So it decreases the bias of Naive Bayes, but it's good for thousands of examples rather than hundreds of thousands of examples. It's next so it's getting sort of sixes to eights. Ten is next. So ten is like KDB with k equals 1. BayesNet, which can learn arbitrary complexity networks, is next. Even though Vowpal Wabbit performance ->>: [inaudible] on real values as you discretize for all of them or like ->> Geoff Webb: No. So -- so -- so this and this -- so -- so random forest and Vowpal Wabbit, the ones that can use numeric attributes, are using the raw numeric attributes. And I think that's why on the U.S. Postal Service they do so much better. Here's KDB where you magically select the best K. And selective KDB has the lowest average rank. So it's not always best. The highest rank it gets is it ties at 4th place. >>: KDB was optimal, the best k. >> Geoff Webb: This was the best k. So magically after the event it's -- >>: Right. So how could it be worse than selective KDB? >> Geoff Webb: Because this is selecting both K and the number of entries. This is the best k but with all attributes. >>: Okay. >> Geoff Webb: All right. So some observations. We're dealing very well with high-dimensional data. Perhaps because of our attributes selection. And we're dealing very well with large quantities of data. So that's probably comparing with these, the bigger the data quantity, the bigger the advantage we're going to have over other ones down there. Random forest is performing well, relatively, when there are a small number of attributes. And Vowpal Wabbit has an advantage for sparse numeric data because it is designed to deal with sparsity and it can extract more information from numeric data than we can. So that's a different way of viewing basically the same thing. So it's the mean invariance of the ranking. >>: What's the largest [inaudible]? >> Geoff Webb: So it's 564 million examples. We would love to have some more large datasets to play with if somebody was to kindly give them to us. So for that large what's the large number of dimensions or for that large one what's the number of dimensions? >>: The largest one. >> Geoff Webb: The largest number of dimensions is about 700. >>: I'm sorry. >> Geoff Webb: So the largest dataset has -- I think it has about 200, 2- to 300. So global comparison of where we're at for training. So the simple Bayesian classifiers are faster at training. K selective -- now I'm doubting my own -- I'm not sure why we're putting ourselves as better here. And clearly at test time we're performing far better. AODE handles high-dimensional data very poorly, which is why it's got such a bad ->>: [inaudible]? >> Geoff Webb: VW wasn't done in Weka. >>: Oh, okay. [inaudible]. >> Geoff Webb: So we've got quadratic features. And for multivalued -- for multivalued categorical attributes, they have to be binarized. So the dimensionality can rise a huge amount as a result of that. >>: It's usually pretty efficient doing [inaudible]. >> Geoff Webb: Well, most of the data isn't sparse. >>: Once you binarize [inaudible]. >> Geoff Webb: But not all the of the attributes are like that. So not all of the data is sparse. >> Geoff Webb: Okay. So step back. The trick that we've introduced here is nested evaluation of a large class of count-based models. And I think we've shown that this can be very effective. We've shown it where that class of unnested models is generated by KDB. There are many different ways you could actually generate that class of nested models. So we've also done the same trick with average independence estimators, which gives us two-pass learner. Which is also quite good but not quite as accurate as we're able to get. It doesn't scale to as high levels of N in this case as you can get K. So it's such high-order interactions. What remains to be done . Pretty sure numeric attributes has been raised a number of times. It seems like there should be something better to do. We've looked -- over many years now have looked at doing all sorts things with numeric attributes and Bayesian network classifiers. And solution remains elusive. Our preliminary observations suggest that there's some bad overfitting occurring, even with the leave-one-out cross-validation. So I have some ideas about how to handle this, but clearly we need to do something to avoid that overfitting. There are many, many, many ways we could increase the space of models. >>: [inaudible] are you saying that leave-one-out seems to overfit? >> Geoff Webb: Yep. And so what we're doing is we're selecting the model that has the lowest error. Now, that may be just one example. So what happens to be in the training data may result in one between a very large class of models that are actually pretty much the same level getting chosen. And where that's the case, we'd actually prefer to choose the one with the fewest attributes and the lowest K. >>: [inaudible] we've done some work on what we call incremental cross-validation, and we've observed the same thing. And we actually went to doing 10- or 20-fold cross-validation and found it to be more accurate. >> Geoff Webb: Okay. >>: And you can still do the same trick with incremental classification ->> Geoff Webb: Yeah, yeah, yeah. >>: -- remove things, you evaluate, you come back. >> Geoff Webb: Yep. >>: But we -- the key point was you have to allow the structure to change. >> Geoff Webb: Yep. >>: You're doing this thing on all the data where the structure [inaudible]. >> Geoff Webb: Yep. Yep. >>: But I'll talk to you offline [inaudible]. >> Geoff Webb: Yeah, yeah, yeah. I'm actually wondering whether it wouldn't be better to just do the leave-one-out on the small sample rather than all the data, so as a -- but, anyway, we'll see. So we can increase the range of alternative models. It's going to increase the problem of fitting, of course, so we need to have addressed that first. There are many other ways in which you can get nested, particularly Bayesian network classifiers, maybe alternative sorts of classifiers that are based on count models. I think that we could pull this back to being just the two-pass learner. So the first pass is selecting the structure and in the second we do for training and just sample a smaller set because I don't think we need the full data for the model selection. In fact, maybe it's even harmful. And I think there's a possibility of creating a one pass learner. So if you've got enough data, then you can sort of possibly bootstrap up the complexity of the model using these types of tricks. So to summarize, I believe very strongly that large data isn't just about scaling up existing algorithms. We really need fundamentally different types of algorithms. We need low bias efficient algorithms. I think we are kind of hamstrung by just trying to deal with it as a problem of dealing for computational complexity. And we are working on a new generation of theoretically well-founded, so it's very clear theory behind Bayesian Network Classifier. You know exactly why they do and don't work perfectly. Which are clearly very scalable, hence they're capable of forming low bias models. End of transmission. Questions. >> Xiaodong He: Any questions? [inaudible]. [applause] >> Geoff Webb: Thank you.

>> Xiaodong He: So let's start. So, everyone,... Professor Geoff Webb. Geoff is a professor in the...

Related documents

Products

Support

&gt;&gt; Xiaodong He: So let's start. So, everyone,... Professor Geoff Webb. Geoff is a professor in the...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Xiaodong He: So let's start. So, everyone,... Professor Geoff Webb. Geoff is a professor in the...