>> Xiaodong He: So let's start. So, everyone,... Professor Geoff Webb. Geoff is a professor in the...

advertisement
>> Xiaodong He: So let's start. So, everyone, it's my great pleasure to introduce
Professor Geoff Webb. Geoff is a professor in the Faculty of Information
Technology at Monash University, Australia, where he heads the Centre of the
Research for Intelligent Systems.
His primary research area machine learning, data mining, and the user modeling.
He's editor-in-chief of Data Mining and Knowledge Discovery, co-editor of the
Springer Encyclopedia of Machine Learning, a member of the advisory board of
Statistical Analysis and Data Mining, a member of the editorial board of the
Machine Learning, and was a foundation member of the editorial board of ACM
transactions KDD.
He was a co-PC chair of the 2010 IEEE International Conference on Data Mining
and a co-General chair of the 2012 IEEE International Conference on Data
Mining and received the 2013 at IEEE ICDM Service Award.
So today Professor Geoff will talk about large-scale Bayesian network
[inaudible]. Now I'm glad to have Professor Geoff here.
>> Geoff Webb: Thank you for those kind words. Oh, how am I supposed to get
this out of the way? Okay. So ignoring the online audience of millions, we're a
small group here. So please feel free to interrupt as we go along. Happy to
expand on detail.
So what I'm talking about is a continuation of the research I've been doing for a
number of years into essentially extensions [inaudible], so how to take the great
scalability of Naive Bayes and expand upon it.
And we moved recently on to issues of how to best learn from large data. I'm
sure I don't need to explain to any of you the ever-increasing need for learning
from ever larger quantities of data.
But most people's response to the problem of learning from large data appears to
be how do we take our existing algorithms and scale them up. And my argument
is that learning best from large data is not a question of how do we make existing
algorithms best cope with the computational challenges of large data but rather
that we need fundamentally new algorithms.
So let me start with giving you a very simple overview of why I believe this. I'm
sure we're all very familiar with the idea of a learning curve. So as data quantity
increases, error typically starts at a high level and drops and at some asymptotes.
So you can take any learning and plot a learning curve. And, of course, different
algorithms are going to have different learning curves.
And in particular we will see that algorithms that are able to provide very detailed
descriptions of complex multivariate distributions are going to tend to overfit
when data quantities are small. So they're going to be outperformed by low
bias -- by low-variance algorithms for small data but they're going to tend to
perform better on large data.
>> Xiaodong He: So this RMSE, is that training data or is that test data?
>> Geoff Webb: So this is on test data.
>> Xiaodong He: I see.
>> Geoff Webb: So this has been done on learning curves where we take -- so we
hold out some data and then we do ever increasing training set size. But all these
examples, I'm going to show a couple of these curves, and they've all been taken
on the Poker Hand dataset because it provides very nice illustrations. But the
same point, the curve is not always as fairly smooth as this, but the same thing
holds for many, many different types of datasets. The process is fairly well
understood as the bias-variance tradeoff.
Now, most machine learning research has been conducted with things like the
UCI dataset where the majority of datasets are actually no more than 10,000, in
fact, even less than 1,000 examples. So they're way, way, way down the bottom
end of these learning curves. So algorithms that are going to be good here look
extremely bad in the space where most machine learning research has been
conducted.
So what we need is algorithms that can closely fit complex multivariate
distributions. But the majority of machine learning research, I believe, has just
ignored these because they actually look very bad in the space where the majority
of research has been done. We need algorithms that are both low bias but are also
very computationally efficient.
All right. So we have the low bias. Being computationally efficient means you
cannot spend a lot of time on each training example. Why not? Because if you've
got large numbers of them, then you simply have multiplied large numbers by a
large amount of time, it takes too much time. And at some level of scalability, if
you're going to use all the data, you have to be processing out of core.
Now, most existing -- so most state-of-the-art low-bias algorithms do not scale.
I'm sure you're all familiar with this. And typical low-bias algorithms, state of the
art, typical examples are random forest, support vector machines, and neural
networks. These are all inherently in core. You need to look at every example
many times. So you cannot process them out of core.
So what I'm going to talk about today is the selective KDB classifier which is a
scalable low-bias Bayesian Network Classifier.
The structure of the talk is the introduction, which I've just given. I'll talk very
briefly about Bayesian Network Classifiers. I think most of you should be pretty
familiar with the background, but I'll show you the technology I'm using and talk
about the new algorithm, give you some experimental results, of course, being an
experimentalist, and then talk about what next.
So Bayesian network classifiers. What are they? Well, they are defined by two
things. Parent relation, so what links are we going to include in the classifier, and
then the Conditional Probability Tables which provides the conditional
probabilities that are required.
Then we classify given some class using the posterior probability as proportional
to the joint probability of the class and the set of attribute values, which is the
probability of the class, given its parents, times the product of the probabilities of
the X values given their parents.
And usually what we do in a Bayesian Network Classifier is we make the class a
parent of all of the other variables. And I have some illustration of a Naive Bayes
classifier, which I'm sure you're all familiar with.
Now, one of the very nice things about these types of classifiers is once you've
selected your parent relation, once you've got the structure, you can learn the
Conditional Probability Tables very quickly by just look -- just uncovering the
joint counts that understand lie it.
So in a single pass through the data, you know which joint frequency counts you
need, in a single pass and also in an incremental manner. So a very nice thing is
once you've got the structure, you can just keep refining a classifier by just keep
updating the counts. You never need to relearn.
So the particular Bayesian Network Classifier that we're working on -- I'm talking
about here is the k-Dependence Bayes. So this is a recently old one, mid '90s, I
think. Mehran Hasami -- Mehran Sahami developed this. It requires two pass
learning.
So in the first pass, you developed the structure, you're going to use second pass,
simply fills in the counts. And in that first pass you collect counts necessary for
both the mutual information between the attributes and the class and the
conditional mutual information between each pair of attributes and the class.
And you then order all the attributes, so you sort the attributes into an order based
on the mutual information on the class, highest mutual information, so therefore
the most informative to the left and the less informative to the right.
And then you go through the attributes in turn selecting parents, so the classes of
parents of every attribute, and then the parents have to be selected from those with
high mutual information and you select up to K, so K is a user-defined parameter.
You select the K high mutual -- high -- or K earlier attributes that have the highest
conditional mutual information.
So if we're looking at the parents for X4, look at which of X1, 2, or 3 has the
highest conditional mutual information conditioned on the class. So it's the
highest mutual information with X4 conditioned on the class.
>>: So -- sorry. So [inaudible] X4 for X5 is because the [inaudible] information
[inaudible]?
>> Geoff Webb: So what I've illustrated here is KDB with k equals 2. So every
attribute can have -- has the class as a parent and at most two of the prior
attributes. So there's no prior attributes here. There's only one, so it has to have
this. There's only two, so it has to have both of them. Here there was a choice of
two of these three, and it took the two that had the highest mutual information
with X4 conditioned on Y. So that didn't include X1. And here we had a choice
of 2, and we've left out X4 and X2.
>>: Okay. So that's just on your first pass.
>> Geoff Webb: That's done in -- right. So the first pass collects the two-way
tables of the joint frequency of each attribute and each attribute value and each
class value to work out the mutual information with the class and then the
three-way joint frequencies of each pair of attribute values and each class value in
order to work out the conditional mutual information between each pair of
attributes conditioned on the class.
>>: Remind me how you deal with reels.
>> Geoff Webb: With reels. So with all of this we're doing discretization. So the
discretization we're going to use is equal frequency five bin.
>>: Okay.
>> Geoff Webb: Okay? So one pass learns structure, second pass learns
Conditional Probability Tables.
So this is quite a nice algorithm for many points of view. So it has training space.
So the first term there is the size of the joint tables for the conditional mutual
information, so each pair of attributes, each pair of As. So Y is the number of
classes, A the number of attributes, V is the average number of values per
attribute.
And then the second term is the complexity of going -- sorry. This is training
space, so the second one is the complexity of the count tables that you've learned
in the second pass. All right? So they have to go down for each attribute, do all
the combinations of values for it, the class, and all of the attributes that are its
parents.
Classification time you only need that second part. You abandon this part.
Training time, you've got -- so this is the first phase, so you've got the collecting
of the -- so you've got the collection of the tables, you've got the calculation of the
conditional mutual information, and then you've got the collecting of the table.
And depending on the type of data you're dealing with, any of these three terms
might dominate.
We can see that it's linear with respect to the data quantity, which is very nice.
And classification time is quite fast. You just go along you need to for each class
you're going to work out for each attribute, you send a tree over K values to find
the correct count.
Now, an interesting thing about KDB is you've got this variable K, and it very
clearly controls a bias-variance tradeoff. So in my first plot I was actually -- my
first set of learning curves, the two algorithms were KDB with k equals 2 and k
equals 5. And here I've plotted KDB with k equals 0, which is Naive Bayes
through to KDB k equals 1, 2, 3, 4, and 5. We can see that 5 is still a long way
from asymptoting.
Okay. So we've got a nice way of controlling bias and variance, but we've got a
problem that we don't know what is the right value of this for any particular
dataset. The only way we can work this out is by actually trying to ->>: Didn't you say [inaudible] 4 and 5 flipped?
>> Geoff Webb: So -- so 4 is getting close to asymptoting; 5 is going to
asymptote out here somewhere.
>>: Oh, okay. Looks [inaudible] hard to read.
>> Geoff Webb: And so this -- 4 is here, 5 is here.
>>: So you [inaudible].
>> Geoff Webb: Yep.
>>: [inaudible].
>> Geoff Webb: Yep. So it's very clear that if you have enough data, a high
value of K can never hurt. Because all you're doing is adding additional links.
And if you've got enough -- enough data, then even if they're irrelevant, they'll
just -- the data will tell you that and will factor itself out.
>>: Okay.
>> Geoff Webb: Okay? So adding K only hurts insofar as the probability
estimates that you develop from the data are inaccurate. And the higher the K, the
less data each of those estimates is taken from, so the less accurate they'll be
unless you have an enough data. So somewhere here this has to at least join this
one, if you just get enough data.
Okay. So there's no way to select for any given dataset in advance. But some -the attribute independence assumption might actually be correct. So 0 might
actually always be the best value. The other ones will eventually asymptote to
this, but for a starter, how they want. So we don't know if from that point ->>: What dataset are you showing here?
>> Geoff Webb: So this is the Poker Hand dataset. But the same thing has to be
true for any dataset. Because the additional links are only harmful insofar as the
probability estimates are inaccurate because you don't have enough data. So if
you go to infinity, the higher K is always going to, in the worst case, asymptote
the same value as a lower K and can asymptote to a bit of As because they can
describe a wider range of distributions.
And also ->>: There's no -- there's no attempt to reduce the variance, like the growth of
your table grows exponentially with K, right?
>> Geoff Webb: Yep.
>>: [inaudible].
>> Geoff Webb: Yep.
>>: And so there may still be events that you haven't seen enough, you're not
attempting to reduce the variance of -- what do you do with zeros or things like
that?
>> Geoff Webb: So we -- we're doing an M estimate to smooth off the estimate.
>>: [inaudible].
>> Geoff Webb: Yep. We're doing an M estimate on all of the -- all -- all of
probability estimates that we develop. Spurious attributes may also increase the
error. And, again, if there's enough data, they wouldn't matter. But if there's not
enough data, they're going to introduce a slight amount of noise. And, again,
we've got no way of selecting that.
Now, an interesting observation about the KDB classifier is that a full KDB
classifier actually embeds a whole lot of simpler classifiers. So for any MIJ
where we're doing I is a value for K and J as a number of attributes, so we're
ordering the number of attributes here by mutual information, because that's the
way in which the classifiers developed, so any MIJ is minor extension of MI
minus 1J and MIJ minus 1.
So, for example, with the KDB classifier we illustrated earlier, you can simply
take out one of the level of links. And here we have the KDB k equals 1 where
we've just got for each attribute the parents with the highest mutual information.
And you can see I think how to have encoded the full one, we have to have also
encoded the lower level one. Or we could just take out the last attribute. And,
again, to have encoded the full one, we have to have also encoded this one.
So we've actually -- in forming the full classifier, we've actually also created all of
these other classifiers.
So the very simple trick that we're going to do is in one more pass through the
data select between all these classifiers. And we're going to do it using
leave-one-out cross-validation.
So the full model subsumes K times A submodels. Right? So A's number of
attributes, K is the value of -- actually, it's K plus 1 minus A because you've also
got k equals 0, the Naive Bayes model. So each of these is a very powerful
model, and we're going to very efficiently select between a large class of strong
models.
>>: [inaudible] leave-one-out only in phase two?
>> Geoff Webb: No, we're going to do it in phase three.
>>: No, no, but how does phase one and two run?
>> Geoff Webb: Phase one and two is learning full KDB model. Phase three is
now going to be ->>: You're using all the data?
>> Geoff Webb: Yep.
>>: So the leave-one-out is a little bit of a cheat, right?
>> Geoff Webb: No. Why?
>>: Because you've used all the data to learn the structure.
>> Geoff Webb: But when we do leave-one-out, we're going to leave it out in
order to -- or to learn the structure. So depends what you think is being cheated.
So possibly. But we're looking at very large data. And we're not claiming to
learn exactly the same ->>: But it's not equivalent ->> Geoff Webb: -- as you would ->>: -- of learning [inaudible].
>> Geoff Webb: It's not equivalent ->>: But it's close. You're [inaudible] fudge.
>> Geoff Webb: Yeah. Well ->>: [inaudible].
>> Geoff Webb: -- I'm not really saying that it's --
>>: [inaudible].
>> Geoff Webb: Yeah. It's not really a fudge. I'm not claiming that it's exactly
the same as if you did leave-one-out cross-validation, setting a different value of
the number of attributes and a different value of K and then compared all the
results. I'm not claiming that it's equivalent to that.
So why leave-one-out cross-validation? Because leave-one-out cross-validation is
a very low bias estimator of out-of-sample performance and because Pazzani's
trick makes it extremely efficient for Bayesian Network Classifiers.
What's Pazzani's trick? Pazzani's trick is you collect the count tables that you
need to estimate the probabilities and then when you come to classify one thing,
you simply subtract it from the count table. So very efficiently you can perform
the leave-one-out. You don't need to learn a new model each time.
So because the full model subsumes all the other models, to evaluate all the
submodels actually doesn't take all that much more computation than just
evaluating the full model in this way. So it's very, very efficient.
So the resulting complexity, well, the space is exactly the same because the full
model encodes all the models that we're dealing with, so we require no more
space. At classification time, we can actually save space because we now might
have smaller values of A and K, so we can drop out of the unneeded parts out.
And the complexity -- the only increase is this final additional pass, which is as
previously -- so previously we had the number of examples times number of
attributes times the value of K, because we went through doing the counts.
Now we're going to have to look at some for each class in the final pass to do the
evaluation. So it's a slight increase on the complexity. Often in practice one of
these two -- sometimes actually just compiling the conditional mutual information
will still dominate the training time. So this is not the dominant term.
And classification time, again, we may actually be faster because we've dropped
parts out of the model.
Okay. So this is back to our curves for KDB equals 0, 1, 2, 3, 4, 5. This is what
happens if we do our trick and we don't learn -- right, so we're setting the number
of attributes, so we're using the full attributes. We're just selecting the appropriate
value of K. You can see we've done very well. We actually improve a little bit
upon the best K because sometimes the lower K is performing better perhaps. So
this is the average over 10, 10 runs. Some runs one did better than the other, and
it's probably selected the wrong one.
Here we're overfitting and we jump to 4 earlier than we ideally would. And quite
badly overfitting something that we want to do something about and something
that I'm sure has maybe given us a good pointer as to what we should do.
Here is if we just take KDB is 5 and select attributes without selecting K. So you
can see we substantially improving upon the version without attribute selection.
And here is where we do both together. So we're not doing much in the way of
attribute selection with small amounts of data, and then we have this overfitting
effect where we're jumping to -- so it happens that with the attribute selection k
equals 4 and k equals 5, both track along the lines, so they're both fairly
equivalent.
>>: [inaudible].
>> Geoff Webb: Yeah. So these are plots of results I got last weekend. So I
haven't had time to work out the exact reason for the overfitting. I've got some
ideas about how to avoid it. But yes. I don't have a good explanation of why they
are such clear points.
So the -- so sudden one here, sudden one here, sudden one here, and it's clearly an
extreme, extreme event. It may be something to do with the Poker Hand data,
which is about Poker Hand, so in some ways fairly artificial dataset. And there
are I think 10 classes, which the different types of hand and with -- with some of
the classes there are only very small numbers of examples. So these may be the
points at which you appear to get enough evidence to start to be able to accurately
classify another class and perhaps it's mistaken in that.
>>: [inaudible].
>> Geoff Webb: Sorry?
>>: How many attributes?
>> Geoff Webb: How many attributes? These are only a small number of
attributes, 10 attributes.
>>: [inaudible] look at this one, the two -- two algorithm for the key selection
[inaudible] prematurely jump to another ->> Geoff Webb: Yep.
>>: -- K value, and that one [inaudible].
>> Geoff Webb: That's jumping to the wrong number -- yep. Yep.
>>: [inaudible] for K and also very large number of attributes may become
smoother.
>> Geoff Webb: May become smoother. May also become worse because it
may ->>: [inaudible].
>> Geoff Webb: Yep. Or it may become actually -- may jump all over the place.
Okay. So now on to some experimentation. So we've taken the 16 largest
datasets that were for this type of attribute and value, learning that we're able to
get our hands on. Most of them have numeric data, so that's been -- that's been
discretized.
And while they're large in terms of classical machine learning research, going up
to 54 million, we should note they're really quite small in terms of real-world
applications these days. But there's a fair variety in terms of the dimensionality
and the number of classes.
So we're doing the discretization, and we've already discussed. So, first of all,
let's look at performance against KDB. So if we take the -- right, so here the
selective KDB, we're always using k equals 5. Right? For no very good reason.
So we can pair selective KDB against the full -- full classifier.
This is plotting root mean squared error. Below the line means selective KDB is
doing better. And you can see it's always doing better if only marginally.
Sometimes very, very substantially. This is plotted on a log scale.
And here what we've done is we've observed after the fact which value of K is
best. This is in some respects a theoretical result. So we've taken the best
performing K and plotted it against selective KDB. And we can see that we're
still sometimes performing substantially better, and this is clearly the attribute
selection that's doing this.
What sort of cost does it have in terms of training time? So here we're comparing
against out-of-core Bayesian network classifiers. So Naive Bayes, we all know,
10, some of you may know, so that's a bit like KDB with k equals 1.
AODE is my group's extensions to Naive Bayes. This is the full KDB classifier.
And selective KDB again plotted on a log scale. We can see training time,
sometimes the increase in cost is not too much. Sometimes going from KDB
there is a substantial increase in time. So varies somewhat.
Classification time, you can see we're doing quite well. We're always no worse
than KDB k equals 5 for obvious reasons. Usually substantially worse than Naive
Bayes, but sometimes not even all that much worse than Naive Bayes.
Okay. So what about things ->>: [inaudible] the slide? Where's the performance, the accuracy of these?
>> Geoff Webb: So accuracy ->>: [inaudible] how much better are you doing in terms of ->> Geoff Webb: Yeah. Okay. So I've left out the performance here. We're
going to show a summary at the very end of this, show the RMSE. So we showed
against KDB k equals 5, which actually
outperforms all of these on these
larger datasets.
So what about other out-of-core alternatives. So stochastic gradient descent in
Vowpal Wabbit is perhaps state-of-the-art in out-of-core classifiers. So here
we've used all of the default settings except in that we've looked at both squared
and log loss.
We're using quadratic features which provides the best performance. So for those
not familiar with what that is, that means we take each combination of attributes.
As the base features. And we're trying a number of passes. So three passes gives
us something equivalent to what we're doing, and also ten passes turned out to not
perform any better.
>>: And what is [inaudible]?
>> Geoff Webb: And so Vowpal Wabbit you can train through any number of
iterations through the data.
Discrete attributes have been made into binary features, which is the appropriate
way of dealing with them. And for multiclass classification, we're doing one
against all. So a nice thing about KDB is that it very nicely handles multifeatures.
So if we look with a squared loss, we don't easily get a probability value out of it.
So the fairest comparison is on 0-1 loss rather than on the accuracy of the
probability estimates. And on the squared loss selective KDB gets lower error
eight times and Vowpal Wabbit seven times, so fairly close.
You can see there's one fairly strong win on 0-1 loss, which is the U.S. Postal
Service extended dataset, which is a large amount of sparse numeric data. So
we're probably losing out in the discretization, but I think also sparseness is very,
very important. Vowpal Wabbit is able to take advantage of that in a way that we
can't.
And here we are with the logistic function which does convert appropriately into a
probability estimate. And we're doing root mean squared error. This biggest win
is again the U.S. Postal Service extended, and you can see that the win there is
much less extreme.
With the logistic function, which performs better, observation that requires far
more computation and we've not been able to complete the computation for two
of the datasets, which are shown as Xs here. And we've actually used the Vowpal
Wabbit there.
If we look at training time, you can see that perhaps again plotted on a log scale
selective KDB is only that one case where we're performing poorly, which is very
sparse data requiring more computation. Remember, this is just on three passes
through the data. And often requiring substantially less computation. And
classification time it also tends to be faster. Again, this one outlier.
If we compare against in-core state-of-the-art techniques, so I've chosen random
forest as the exemplar of in-core state of the art, because it's not parameterized,
we've just taken the Weka version and run it with the fault setting.
So a problem I have whenever I compare against the state of the art is someone
always says why don't you use this -- this setting, random forest I found results in
the fewest arguments along that.
The state of the art in terms of BayesNet learners, so this is the Weka in-core hill
climbing search for the best Bayesian Network Classifier. That's random forest
again. Below the line is selective KDB, getting the lowest RMSE. Above the line
is the alternative winning, you can see that we're never doing substantially worse
than because BayesNet and actually doing far better.
Because we -- and perhaps one reason for this is because we're not doing a hill
climbing search. We're actually taking a very strong family of learners and
looking -- looking between them. So it's a little bit like doing a search from the
full classifier backwards rather than hill climbing from Naive Bayes up.
Sometimes random forest does substantially better. These Xs are where we aren't
able to complete the computation on the full dataset, so actually only learning on
a sampled dataset. So that's possible that we'd actually perform. So we would be
able to get performance by learning on the full dataset, which is not feasible with
random forest ->>: [inaudible] the sample, your [inaudible]?
>> Geoff Webb: Both. So here we've compared performance on a sample for
these datasets. Because random forest can't be completed on the full dataset. We
can complete on the full dataset, so we could get better performance on that
dataset. And -- and ->>: Isn't the [inaudible] what's the best random forest can do, if you have to
sample, you have to sample, versus the best you can do? Like that ->> Geoff Webb: So -- so we've compared -- we've made equivalent comparison,
right, so both are working on the same sample. So ->>: I understand, but like in that X [inaudible] significantly ->> Geoff Webb: Yes.
>> If you use the full data, would you be better?
>> Geoff Webb: We would be better, yes.
>>: Okay.
>> Geoff Webb: Yes. So --
>>: Isn't that your main thesis?
>> Geoff Webb: Yeah, so -- so I'm trying to give a very clear picture of what our
performance is. I mean, random forest, if you had more compute power, you guys
would be able to run random forest on these assets on the -- on the full data.
So the -- we're clearly doing much less computation. And the only claim I want
to make is not that we're performing at a higher level than in-core classifiers, but
I'm wanting to make very clear, incontrovertible, that we are actually performing
at a very comparable level to in-core.
So out-of-core, we can perform at a level which is very similar to what you can
achieve with random forest in-core.
All right. And this is, again, our log scale. The time comparisons. Here we are
comparing C coded against a Java thing, so you can't pay too much attention to
this. And the key thing is that we're out-of-core. These are in-core.
And we also bearing an extra cost because we're actually having to load the data
off disk three times whereas these are loading off disk once. And classification
time, we're often faster than -- random forest always faster than BayesNet.
So here's the comparison of all of them to keep Ronnie happy. So takes a little
while to get used to what this is all about. We've got color coding of each of the
datasets. We've taken the error and ranked this. So a rank of 1 means we've got
the lowest error on that dataset.
And then each of these plots is 1, 2, 3, to 8. That's how we can see Naive Bayes
is almost always 7 or 8. AODE, which is improvement on Naive Bayes but really
an improvement for small data still. So it decreases the bias of Naive Bayes, but
it's good for thousands of examples rather than hundreds of thousands of
examples. It's next so it's getting sort of sixes to eights. Ten is next. So ten is
like KDB with k equals 1. BayesNet, which can learn arbitrary complexity
networks, is next. Even though Vowpal Wabbit performance ->>: [inaudible] on real values as you discretize for all of them or like ->> Geoff Webb: No. So -- so -- so this and this -- so -- so random forest and
Vowpal Wabbit, the ones that can use numeric attributes, are using the raw
numeric attributes. And I think that's why on the U.S. Postal Service they do so
much better.
Here's KDB where you magically select the best K. And selective KDB has the
lowest average rank. So it's not always best. The highest rank it gets is it ties at
4th place.
>>: KDB was optimal, the best k.
>> Geoff Webb: This was the best k. So magically after the event it's --
>>: Right. So how could it be worse than selective KDB?
>> Geoff Webb: Because this is selecting both K and the number of entries. This
is the best k but with all attributes.
>>: Okay.
>> Geoff Webb: All right. So some observations. We're dealing very well with
high-dimensional data. Perhaps because of our attributes selection. And we're
dealing very well with large quantities of data. So that's probably comparing with
these, the bigger the data quantity, the bigger the advantage we're going to have
over other ones down there.
Random forest is performing well, relatively, when there are a small number of
attributes. And Vowpal Wabbit has an advantage for sparse numeric data because
it is designed to deal with sparsity and it can extract more information from
numeric data than we can. So that's a different way of viewing basically the same
thing. So it's the mean invariance of the ranking.
>>: What's the largest [inaudible]?
>> Geoff Webb: So it's 564 million examples. We would love to have some
more large datasets to play with if somebody was to kindly give them to us. So
for that large what's the large number of dimensions or for that large one what's
the number of dimensions?
>>: The largest one.
>> Geoff Webb: The largest number of dimensions is about 700.
>>: I'm sorry.
>> Geoff Webb: So the largest dataset has -- I think it has about 200, 2- to 300.
So global comparison of where we're at for training. So the simple Bayesian
classifiers are faster at training. K selective -- now I'm doubting my own -- I'm
not sure why we're putting ourselves as better here. And clearly at test time we're
performing far better. AODE handles high-dimensional data very poorly, which
is why it's got such a bad ->>: [inaudible]?
>> Geoff Webb: VW wasn't done in Weka.
>>: Oh, okay. [inaudible].
>> Geoff Webb: So we've got quadratic features. And for multivalued -- for
multivalued categorical attributes, they have to be binarized. So the
dimensionality can rise a huge amount as a result of that.
>>: It's usually pretty efficient doing [inaudible].
>> Geoff Webb: Well, most of the data isn't sparse.
>>: Once you binarize [inaudible].
>> Geoff Webb: But not all the of the attributes are like that. So not all of the
data is sparse.
>> Geoff Webb: Okay. So step back. The trick that we've introduced here is
nested evaluation of a large class of count-based models. And I think we've
shown that this can be very effective. We've shown it where that class of
unnested models is generated by KDB.
There are many different ways you could actually generate that class of nested
models. So we've also done the same trick with average independence estimators,
which gives us two-pass learner. Which is also quite good but not quite as
accurate as we're able to get. It doesn't scale to as high levels of N in this case as
you can get K. So it's such high-order interactions.
What remains to be done
. Pretty sure numeric attributes has been raised a
number of times. It seems like there should be something better to do. We've
looked -- over many years now have looked at doing all sorts things with numeric
attributes and Bayesian network classifiers.
And solution remains elusive. Our preliminary observations suggest that there's
some bad overfitting occurring, even with the leave-one-out cross-validation. So
I have some ideas about how to handle this, but clearly we need to do something
to avoid that overfitting.
There are many, many, many ways we could increase the space of models.
>>: [inaudible] are you saying that leave-one-out seems to overfit?
>> Geoff Webb: Yep. And so what we're doing is we're selecting the model that
has the lowest error. Now, that may be just one example. So what happens to be
in the training data may result in one between a very large class of models that are
actually pretty much the same level getting chosen. And where that's the case,
we'd actually prefer to choose the one with the fewest attributes and the lowest K.
>>: [inaudible] we've done some work on what we call incremental
cross-validation, and we've observed the same thing. And we actually went to
doing 10- or 20-fold cross-validation and found it to be more accurate.
>> Geoff Webb: Okay.
>>: And you can still do the same trick with incremental classification ->> Geoff Webb: Yeah, yeah, yeah.
>>: -- remove things, you evaluate, you come back.
>> Geoff Webb: Yep.
>>: But we -- the key point was you have to allow the structure to change.
>> Geoff Webb: Yep.
>>: You're doing this thing on all the data where the structure [inaudible].
>> Geoff Webb: Yep. Yep.
>>: But I'll talk to you offline [inaudible].
>> Geoff Webb: Yeah, yeah, yeah. I'm actually wondering whether it wouldn't
be better to just do the leave-one-out on the small sample rather than all the data,
so as a -- but, anyway, we'll see.
So we can increase the range of alternative models. It's going to increase the
problem of fitting, of course, so we need to have addressed that first. There are
many other ways in which you can get nested, particularly Bayesian network
classifiers, maybe alternative sorts of classifiers that are based on count models.
I think that we could pull this back to being just the two-pass learner. So the first
pass is selecting the structure and in the second we do for training and just sample
a smaller set because I don't think we need the full data for the model selection.
In fact, maybe it's even harmful.
And I think there's a possibility of creating a one pass learner. So if you've got
enough data, then you can sort of possibly bootstrap up the complexity of the
model using these types of tricks.
So to summarize, I believe very strongly that large data isn't just about scaling up
existing algorithms. We really need fundamentally different types of algorithms.
We need low bias efficient algorithms. I think we are kind of hamstrung by just
trying to deal with it as a problem of dealing for computational complexity.
And we are working on a new generation of theoretically well-founded, so it's
very clear theory behind Bayesian Network Classifier. You know exactly why
they do and don't work perfectly. Which are clearly very scalable, hence they're
capable of forming low bias models. End of transmission. Questions.
>> Xiaodong He: Any questions? [inaudible].
[applause]
>> Geoff Webb: Thank you.
Download