18205

advertisement
18205
>> Dengyong Zhou: It's my pleasure to invite Jian Zhang as our speaker this
afternoon. Jian is from Purdue University, the Department of Statistics,
currently Assistant Professor. And before he got his Ph.D. from CMU in
computer science in 2006. His basic research is including statistical machine
learning. So, please.
>> Jian Zhang: Yeah. First I would like to thank Dengyong for inviting me
here. So it's really my pleasure to visit here and also it's very nice to see
some old friends. Okay. So my talk today is about large-scale learning by
using a technique we call data compression. As you will see, this is actually
a very simple idea. And this is joint work with Zong Yen. Okay. So here's
the outline of the talk.
And I'll first introduce -- talk about the challenge. So the massive data
challenge for large-scale machine learning. And I'll also talk about some
possible approaches that can be used to handle this. And after that I will
introduce a method and give you the motivation on why this method will work.
Then we will actually show you the analysis about the property, the statistical
properties of the method. And after that I'll present some grouping algorithms
which will be used to compress the data. And finally I'll show you some
experimental results and conclude.
So what is the challenge? I think one of the most important challenges in
machine learning nowadays is how do you learn with massive training data. And,
in particular, there could be three difficulties. The first one is about
training time complexity. When you have a dataset with thousands or even
millions of examples, how do you efficiently and effectively learn a predictor,
a classifier or regressor.
The other thing is about testing time. And often it is the case, especially
industry, that you may need to apply the classifier you learned and make
predictions for, say, millions of Web pages. How can you do this efficiently?
And this actually is also related to -- this is particularly important in the
large scale setting because we have a lot of training examples. Typically, you
want to go, the more examples you have, the more complex model you probably
will choose is going more towards a nonparametric side. So in that sense it's
not possible anymore that you can simply store the classifier using the simple
weight vector and doing the efficient prediction.
So you really need to use the training data to make predictions. And the third
one is about storage. And this may not be as severe an issue as the previous
two. And also there is often a trade-off between the memory and the training
time, for example. But this could be a problem in situations, for example,
when you want to deploy the classifier into some portable devices like PDA and
you have limited storage you can use for this case.
And there are many examples that fall into this category, such as if you want
to classify Web pages retrieved by search engines or if you want to do pattern
recognition for a DMA database, and also in cases where you want to classify or
model not only the network data or financial stock markets, so there you have
tons of data. And the last example I put here is the emerging online blogs and
videos and also the social network data. So those are all examples of this
large scale massive dataset we need to deal with.
And so here I would like to briefly go through several possible approaches that
can be used to handle such a situation. And actually some of them are not
originally invented for this particular problem. But the reason I list them
here is because they can be used at least as a way to handle the problem. And
of course I will mention the limitations.
So the first one is it's quite simple. It's called random down sampling. So
the basic idea is that you get a random sample of a subset of original training
data and then you run a classifier or regressor, whatever the algorithm. It's
fairly easy to use. But the problem is very obvious. You simply discard a lot
of valuable information. So you're basing your data.
And this is actually a popular method for the unbalanced classification. So
there you have, for example, very large number of network examples but only
very few positive examples and how to handle the unbalanced dataset.
And the second one I list here is the -- well, active learning methods. The
goal of active learning is that you want to sequentially seek training examples
with labeling, as for the labels, for example, which can mostly improve the
classifier or the predictor. Typically this is done by various criterion such
as you want to find the one that's maximum uncertainty or you want to, if
you're basing, you may want to find the example such that after you add the
sample and the posterior distribution will be the, the variance of posterior
distribution can be greatly improved.
This one is the -- this is the view -- the method of for this approach is of
course it can be very expensive to find out which example is the most valuable
to update your classifier. So in that sense it's more suitable for problems
where the label is expensive to obtain.
But in this case we have plenty of labeled data. So we want to make fully
utilization of these things. So this may not be that appropriate here. But
that makes a possible approach.
And the next one is really not, is really not a single method. It's something
I call a distributed computing. Essentially the idea is to utilize the
distributed computing environment for your learning task.
So one very good example is this Map-Reduce, multi-core machines. Map-Reduce
has been used to help solve a large scale information problems, for example.
And this can be very efficiently done. It can really be used to handle very
large scale dataset.
Now, the problem for this approach is the following: And often it's the case
that it's nontrivial to try to make an algorithm distributively or parallelize
an algorithm and it may not be applicable to many of the learning methods.
The last one I want to -- the last category I want to mention is something I
call it sequential learning methods here, but essentially the well-known
examples are online learning or if you want to train your classifier using
stochastic gradient decent method, for example.
So for this type of approach, basically they will update your classification
rule by sequentially adding, going through the examples one by one or it could
be a small batch by small batch. You start off trying to solve the single
optimization problem as a whole.
This type of method is very scaleable. They are very scaleable, and the
problem -- but they can also cause a lot of time because sometimes you have to
go through multiple passes of the data and also you need a storage of the data
while you do the training.
And for the prediction, if you are running nonparametric models you need a
rather large storage as well. And this method can -- I want to mention it can
also be computing a distributed way. For example, small [indiscernible] they
had some paper recently about how to run this in a distributed computing
environment. So those are the possible approaches.
And the goal here actually we want to have a method which is scaleable and in
terms of both the training time task time as well as the storage. We want to
find something that can be very scaleable and efficient for very large scale
method training set.
So I want to first give some notation here. So assuming that you have some
idea of the valuations X1 to XN follows from a distribution PXY which is
typically fixed but not known.
And for binary classification, you have these binary labels 01 or plus or minus
1, whichever way you want. And for regression this is real, and we're getting
a training set. In our situation this is typically very large.
And PXY is a normal distribution as we mentioned. Learning algorithm is a
procedure which takes this training set and produce a predictor. So it's a
function of the training set. And typically you will also search, when you try
to find this predictor, you're also searching a function class. It's often
called the hypothesis space. It can also be written as a function of the H
function class and the training set.
And we'll mainly be focusing on the supervised learning problems, which
essentially are the classification and regression.
And in this case, in order to evaluate the classifier or your regressor, and we
need to define some loss function. So here L is a loss function. And this is
the function that we will be using to evaluate the goodness of the classifier,
and the risk of a classifier is defined in the expected value of this loss
function. The expectation is taking over PXY. So this is the risk. And this
guy itself is a random variable, because it's a function of the DN. Since the
classifier asked to be using the DN.
The one thing I want to mention the loss function is the one we use for
evaluation. There's not a loss function which is often called the surrogate
function. That's the function you use to train the classifier. Now, in this
talk we don't differentiate those two. We simply assume they're the same.
But there are situations that they are different. For example, if you want to
evaluate using 01 loss and it's hard to direct a minimizing [indiscernible].
It is an atomic surrogate function. But it is well known there are certain
relationships between the excess risk, if you use two different loss functions.
And also I want to give the definition of the risk consistency. So a learning
procedure, it is called risk consistent if the expected risk converge to the R
star. R star typically is the minimum possible risk you can achieve over all
measurable functions. So this is the biggest risk in classification setting.
And essentially this is saying that as more and more training data is provided,
you should converge to the optimal solution. Optimal in the sense that you
also produce the minimum risk.
Okay. So how do we learn with compressed data? Okay. Now let's look at the
standard learning problem. This training strategy N and we will be focusing on
the, one of the most popular frameworks. This risk, regularized risk
minimization.
So essentially you want to solve this optimization P1. This is saying that you
want to minimize empirical loss over the training data since you have the Y
labels you can do that. Plus some regularization term.
So this G of H is roughness penalty which we are penalized very regularly or
complex functions and lambda is just a trade-off, the parameter controls the
trade-off between the goodness of fate and the model complexity.
And here L is the surrogate loss function. As we mentioned earlier we assumed
it's the same as the loss we use for evaluation.
So this is a learning problem and for the large scale, learning with massive
dataset and even though often people use convex loss functions with relatively
simple hypothesis space, and this could be very challenging, because N is very
large. Essentially you want to minimize this objective but for very large N.
Thousands or millions of examples. And how can you effectively and efficiently
solve this problem? So that's the challenge.
And so what do we propose? Well, we propose the following simple idea, which
is first you want to do a partition of the dataset. Let's say your dataset is
DN. Now you want to do a partition GN. And such that in the partitions there
are M N sets. The reason I have this small n subscript, is because those
should typically depend on N, or it's a function of N. So as N changes, MN
should change. But I'm going to suppress the N script in the following.
Assuming you have this partition, and of course for partition you need to have
this, you need to have the whole intersection, and the intersection is empty.
I'm going to define the compressed example. What is compressed example? For
each set in this partition, the compressed example X tilde G, tilde G is simply
a weighted average. So X tilde will be the weighted average of all X and Y
tilde J will be the weighted average of YJ. W is weighted function. Depends
on which X you use and which is the set in the partition.
W, we assume it to be normalized weight. So submission X. If you sum over all
the X for a particular set in the partition, you get 1. And this XG/YG tilde
can be thought as the representer for the particular set. So for each set you
come with one representer. A simple way to understand this, even though this
is written in a more general way, a way to understand this, you can think of
this WX, WX/IAJ as 1 over the cardinality of AJ, if XI belongs to this set.
Because this is a partition. And otherwise you just take a 0. This is the
simplest way to do this, to interpret this.
If you do this, then it becomes a simple average. So it's really simple. So
you partition the data, you take the simple average. For X and for Y as well.
This is how you get this compressed, we call it compressed examples.
And with this compressed data -- and how do you do the learning? Well, again
it is very simple. You don't even need to change the software. So you just do
the following. You feed in the compressed data and you treat XI tilde, XYD
tilde as your data and then you feed this model.
It's the same average as you learned before. This NG is simply the number of
examples in this set AJ. And if you think of they are all the same size and
this will be gone. This will be cancelled with this. Instead of 1 over N, you
have 1 over F. N is the number of sets in the partition. But when you have
different numbers you want to weight them because you have different value
information in each set. You want to properly weight that to make it more
efficient in the sense of submission. This problem we call it a P2.
So this is essentially how we solve the learning problem using the compressed
example. Now, the question essentially is, okay, now if we do the compression
this way, obviously this is much more efficient to solve than the original way,
because typically M will be much smaller than N. And the natural question to
ask is, first, does this one work, if you simply average, does this one work?
And the second one is: How does it compare to the simple approach which is the
random downsampling. What if I randomly sample N and then stay with the method
and run it. The other related questions are how do you treat the partition,
how do you treat the weight function. Those are all the questions that we want
to address.
So any questions?
>>: When you said that this is obviously more efficient because there's less
data, I think the third thing would be to, the time it take times to minimize,
plus the time to find the partition, tell us a way to find the partition which
is significantly faster than minimizing.
>> Jian Zhang: Exactly. That's a very good point. In fact, that's indeed the
case. You are right. So you need to consider both the way you find the
partition and the compressed data and how you learn this. I agree with you.
So here is a simple motivation. Okay. I always want to use this, simple
examples to show this. So we want to start with simple, and you want to find
some things to motivate things but still keep the interesting structure that
you want to study. This is a simple one. So consider linear, the simple
linear regression, and in this case we have Y equals XI beta plus error. And
to make it even simpler, let's assume XI are standard normal, my version of
standard normal. Epsilons are independent of X and is normal with various
sigma squared. In this case we can write down the closed form solution, if you
use the Lees squared estimator. Doesn't matter. This is one is simpler. You
use Lee squared estimator and you get closed form solution. This one is
unbiased. Variance, you can calculate that, equals this quantity.
Now if this seems different from what you've seen, it's because here X is
random. So typically how the variance of sigma square, X transpose X inverse.
If you take another expectation over X because X is random that's what we get.
It's a measurement about how good it is, how confident you are about this
unbiased S major. This is what you get for this simple linear regression
problem.
example.
>>:
Now, let's look at what you will get if you use this compressed
What does P do?
>> Jian Zhang: Sorry. P is the dimension of the X. How many features. So
now let's consider the following. Suppose that we randomly contract the
compressed samples. By randomly what I mean you get standards from standard
normal and for every K samples, K is N divided by M. For every K samples you
average them. This is how you get the X tilde, Y tilde. You take the average,
and it is easy to say that because we started to use very simple example. So
in this case the X tilde and Y tilde, still satisfy this simple linear
equation. So Y tilde equals X tilde, multiplied beta plus epsilon tilde.
Epsilon tilde is the average of the noise in the previous case.
And in this case, the distribution of X is changed. Its normal with 0 mean and
the variance is reduced to 1 over K, instead of the identity. And the error
also ID the variance is again changed, a changed to 1 over K. So this is very
simple. And for this case again we can work out the lease squared. So what
will happen, if you work out this one, it's unbiased again, you get this
quantity. Okay. Now, I'm just looking at this one. If you compare this
quantity with the previous one, actually it's a little bit disappointing? Why?
Because if you think about it, it's sigma squared divided by 1 N minus P minus
1. And what does that tell you? Example you look at the same performance if
you random sample M, which is disappointing. You do a lot of extra work, you
get the same result, what's the point? But actually if you pause a little bit
and really look at how you get this result, you actually find something
interesting. Hopefully you agree with me. Which of the following? If you
catch this variance you get the product of those two. And the variance of the
error is reduced to 1 over K epsilon squared instead of epsilon squared which
is good. This is fairly intuitive. If you average noise, you get better.
You get smaller variance noise. But on the other hand, even though you
improved Y through the compression of the Y, the compression of X also changes
the distribution. Now, the variance of X is also reduced. And it turns out
that that has a bad effect about the S major. This is also hard to understand.
If you do regression, think about that. If you have X spreading out it's very
stable. If you push them together, it gets very, very unstable. You get a bad
result. Now, it happens those two cancel each other and you're back to the
original random sample.
>>: This is basic, how can you have a classification problem with labels
[indiscernible].
>> Jian Zhang:
>>:
Yes.
They're equally balanced.
>> Jian Zhang:
Yes.
>>: And you take some nice blocks, the labels are going to average to 0 just
as to time.
>> Jian Zhang:
>>:
Yes.
So what are you learning?
How does that help?
>> Jian Zhang: That's a very good -- I'll talk about it actually. This is a
very good question. I'll talk about it. In fact, it turns out the
classification we don't do compression this way. That's very good question.
But regression, yes we do it this way. Now, if you look at this, you say that
these two cancel and you actually didn't get anything, this is point list to
do. But once you realize that it's because of those two effects. Now the
natural question to ask is can we keep the good thing and get rid of the bad
thing? We want to keep this, because this makes our data better. You have
more signal in your Y, because you take the average. But you make the X worse,
because it's not as spreading out as it was before. Can you do something
better? Now, if you think about that, the answer is, yes, can you do this.
How do you achieve this? Well, here's the idea. When you average the data,
instead of taking random samples, you only look at local neighborhoods. When
you look at local neighborhoods, you take average of Y. You still get this
bad, the good part of reducing the variance of Y. Now, because you only take
local neighborhood of X and do the average, what will the data look like after
you take the average of each local neighborhood? They will spread almost the
same as what they were before, and that, if you have the same kind of spread,
that means the variance is almost kept the same as before. So essentially what
I'm saying is if you take local neighborhood, you almost keep the variance,
almost, and you reduce this guy and this whole thing gets improved.
That's essentially -- I hope that's an intuitive example to show why this will
work.
>>: Let me try this. So here this is a Lees squared again.
the part, the Y [indiscernible].
>> Jian Zhang:
Lees squared on
Yes.
>>: So what you've done is you've taken actual labels and averaged them and
you're predicting average labels.
>> Jian Zhang:
>>:
Yes.
How does this -- test phase you're going to want to predict --
>> Jian Zhang: Yes, exactly. But next I'm going to show after this one works
in the sense that you do the risk minimization and you can also get quantities
about estimation measures, yeah. So now here comes the analysis essentially
for the more general framework. The previous one is very intuitive example.
But as mentioned by Chris, there's all kinds of problems, like how do you do
classification, why do you predict the average and what if I used different
loss functions and so on and so forth, what if I don't use a linear model but I
report it in space. So here comes the setting. And for simplicity we're going
to assume that when you take the average each group is the same size it really
doesn't matter if you make it more general. It's very simple. Just make
notation simplified. So this is a risk. RX is a risk. This is a guy you want
to ultimately minimize. And RN is empirical risk for the original data. And
RM, R tilde M is the risk for if you were using the compressed data. So these
are the quantities we're going to use. Here are the assumptions. Let me
quickly go through the assumptions and some are standard and some are I'll give
you some motivation about why we need that.
The first one is we assume H [indiscernible] space, and this is very popular
choice in machine learning, of course. And with kernel and we need this
assumption. This is often needed if you do look at it how people are deriving
things. And this is the diagonal for any X is bounded. And we define this to
be the Hubert knob square. This is not essential. You can change this to
something else. This is not really important. And the second assumption is we
assume loss function and the functions in H ellipses continues. This is often
the assumptions you that you see.
And many of the loss functions, they satisfy this one, or they could satisfy
some restricted Lipschitz continuous which means it's Lipschitz within range
Lees squared essentially that does not make much difference. You can do that
with a little bit more technical details.
And the third one -- okay, so this is the one about the weight. How do you
decide what kind of weight should you use. And this one is saying that I'm
going to take simply the average in this case, actually so I'm saying that if
XI belongs to the set you take it to be 0. Don't contribute anything if I'm
not in your set. Otherwise I just contribute equally.
Now, this is not essential, I want to say. This is just to simply notify the
notification. You can make those arguments asymptotic as well. You can change
that. It des not change the result. The crucial thing is the following.
Usually M goes to infinity. If and M divided by N goes to 0 as N goes to
infinity. This is the typical kind of assumption you'll see in nonparametrics
if you do nonparametrics, that's the standard. The false assumption is
probably the most important one, I would say. The false assumption is saying
that this quantity is okay to N. So what is this quantity? This is saying
that again if you take W to be average weight, this is essentially the thing,
the average distance between X, between pairs of X inside of each set. The D
is some distance measure you pick up. You really know them. You can also use
some others. This is the average distance of all the X belong to the same set.
Actually it is big O pita N. This means essentially typically we mean we want
our N to be a sequence converged to 0. This means that the average distance,
if you can -- you can call it the diameter. The diameter will converge to 0 as
you get more and more data. So that's just a sequence of numbers converged to
0 as N goes to infinity.
And the fifth one is actually -- is actually quite intuitive. We define
something called sigma H squared which is the difference between the expected
value of the loss using Y, minus the loss using expected value of Y comma X.
What does this mean? Assume this is finite. What does this mean? This is the
reason I call this sigma H squared is the following. If you take loss to be
squared loss, you take H to be linear function, this is exactly the sigma
squared is the noise level. This is the unavoidable part of the noise. You
cannot do anything better than this.
This is a part you have to have.
And a finally we assume this Y is bounded in this case.
assumptions, okay, yes --
Now, given those
>>: The Lees squared tempo, the reason why this works you assume some IID.
Assumption five is hiding here in some sense, right?
>> Jian Zhang:
>>:
Yes.
So we're not IDing a previous case you wouldn't have the average --
>>:
>> Jian Zhang: If you average they are all the same if there's correlation not
properly correlated you will get some, but not as big as if they were
independent.
>>: And in this more general setting, that's assumption five essentially is
capturing that information? Kind of independence of some notion.
>> Jian Zhang: Yes, in some sense, how
expected value, compared with this one.
about two problems P1 and P2. Remember
thing you would like to do, if you were
the saying we will do after we compress
well this guy can do, conditional
Yeah. Okay. So the first result is
we have problem P1 which is original
able to do the computation. Now P2 is
examples.
How well will P2 do according to P1? Now this result essentially gives you
some closeness measure about the problem P1 and P2. So essentially this is
saying that under previous assumptions we assume the ways are choosing
independent of Y, conditional X. This is implicitly assumed in the previous
case, function of X, not a function of Y.
Then we have the following quantity. This is saying that we opt for, you get a
uniform bound of this quantity. So uniform over H. And this is the objective
you would get if you plug H to the original, the empirical loss. And remember
this guy does not involve the penalty term. It's only the empirical loss part.
This guy is saying what kind of loss you will get if you apply H to the
compressed examples.
And this sigma H is what we defined earlier. It's constant. It's constant if
you fix H. I should say that. And then this guy is in this order, in this OP
order. So this is essentially the result. And the order is about related to M
and daughter N. Daughter N is again how fast the diameter of each group
converged to 0. So here are some remarks about this result. So first one is
we didn't make any parametric assumption about PXY. So it's fairly general.
And the second one is, of course, it can be applied to both classification
regression. I didn't see where it has to be 01 or real numbered. It can be
applied to both classification and regression. The third way is related to the
question Chris asked. So there's a gap sigma H squared depending on H.
Essentially this is the bad news. This is saying the following: If you are
using the compressed example to do the learning, you won't in general gather
the same thing, because that's not the same as this. There is a gap. Now, if
this gap is a constant, you are fine, because you do optimization, you add a
constant minus constant, you get the same result. Same order. You expect the
same thing. If this guy depends on H then that's bad. That is bad. But there
are some special cases. Actually, if you do a little bit of study you will
find, for example, in the previous case, if we have a squared error and you
take a linear form, then this becomes constant, sigma squared in the noise.
Does not depend on H. If that's the case you are fine, you are good, you can
use this result.
What if that's not the case? How do we handle for general loss functions, for
general convex loss, and how do we handle this? It turns out that given this
result a simple modification can solve it. Here's the modification. Now, this
assumption is really nothing more than that how do you do for classification,
because for classification it does not make sense, as Chris mentioned, that you
average the possible negatives. So what are you trying to learn?
This is saying when you do, for classification, you can actually do the
compression condition on the class label. So you look at [indiscernible].
connect, compress it. Essentially. You don't mix from different classes.
You
>>: Can I take you back to regression for a second? Can you go back to the
theory you proved? Seems to me like you're losing a lot. So if you compare
this to you mention the stochastic approaches. So to do the compression, you
still have to touch each example at least once and the stochastic that does one
pass of the data will again touch the data just one time. So you're not really
gaining anything computationally, where we know the stochastic updates will
have a convergence much faster than 1 over square root. You'll have 1 over N
over 1 squared, so you won't have to do multiple passes on the data that's very
big.
>> Jian Zhang: First I agree with you the stochastic method is very scaleable
and is efficient in terms of learning. And there are certain things,
advantages of this one, which I was thinking of mentioning at the end. But the
thing is, first of all, when you do this stochastic, you do need to keep the
data, essentially, when you do for the prediction. If you're really doing
non-parametrics.
The second thing, on the other hand, for this one, if you really need to -- all
you need to start, even if you do non-parametric, all you need to do is store
small sub sample, M samples compared with originally if you want to compress,
to store the original examples. The other thing is, actually, there's, when
you say that the stochastic gradient is efficient, typically for certain
problems, it is fine to just go one pass, but -- if the goal is learning, but
if you look at optimization objective you often need multiple pass of the data
in order to get a good result for this case.
>>: But your algorithm is an aggressive algorithm that has this 1 over N or 1
over N squared rates. And it's huge. Why would you do that.
>> Jian Zhang: I see. Are you talking about this Nestorov [phonetic] type of
thing. When you see 1 over N, 1 over N squared, you mean the loss of the, you
mean the loss of the ->>: 1 over N. Log over N. Think of just stochastic gradient.
[indiscernible], your rates are 1 over N which are already very, very fast.
N is a million, with 1 over N you're very, very, converging.
If
>> Jian Zhang: When you say 1 over N, you mean really the objective covers the
optimum objective in the read of N over N.
>>:
Yes, which is what you have here, you have R over N.
>> Jian Zhang: Right. Here you're thinking the -- you plug in the solution
and to the R you get this result. Yeah. I agree. Essentially, but this, as I
mentioned, that in that case, when you do the prediction, you actually need to
store a whole bunch of the data if you're doing nonparametrics for the
regression. In this case you don't need to.
using this approach.
>>:
Why do you struggle with it?
>>:
What's that, I'm sorry.
>>:
Why are you storing all the data?
This is one of the advantages of
I can see for the final reference --
For central or something.
>> Jian Zhang: If you are really doing some nonparametric, essentially, then
you need to store, let's say, if you're doing a kernel learning, you store the
data corresponding to the after you turn on zeros.
>>:
[indiscernible] doesn't it.
>> Jian Zhang:
>>:
I'm sorry, what's that.
[indiscernible] doesn't store it in training data.
>> Jian Zhang: For the triangle you already assume the function is linear,
right? I mean the W. So essentially you put a very strong parametric
assumption there.
>>: I can just enjoy that, I mean it's old-fashioned and not very good.
nevertheless, use something that doesn't store parametrics.
But,
>> Jian Zhang: Yeah, could be. In that case, I'm not sure. If you don't
think you need to store the data, then I think the stochastic is something that
is very, I guess, is very promising candidate for this case.
But this one, I think the other thing it can be -- in some sense this method is
orthogonal to what people do for online or stochastic gradient. So in that you
really can combine those two methods. There's no, nothing that you have to use
this one based on the other one or vice versa.
Okay. So this is essentially saying that you have a gap between those two
objectives. And if you are doing the classification using by compressing both
possible separately, you essentially have the following result. Okay. This is
saying that if you are doing compressing the classification example in this
way, then the difference between two objectives actually there's no gap. So
gap becomes 0 and those two get in close to each other very well at the rate of
daughter N. So this is essentially the result if you are doing for
classification.
And, of course, it automatically follows that the risk of -- the risk of
converging to the same thing. They will be able to converge to the same thing
since there's supreme bound above the two differences.
So how about the classifier obtained? Because essentially when you do this
minimization, you are trying to make sure that the risk is minimized. It's not
really about the objective. So those two, if you plug in the previous result,
combining those things, is not very hard to come up with this result, which is
saying if you define H star to be the best-in-class, the best one in your
class, that you can achieve the minimum risk, then essentially the excess risk
of this one using the, you obtained using the compressed example, the
classifier, minus the optimal in the class is going to be this quantity. So
this is the result.
Now, it will be nice to compare this with the standard result. What if you get
using just N samples? If you just use N samples, this is the result you get if
you are using the same assumptions. And this is something stronger. It gives
you like the textile, the tail of the probabilities for this one. You can also
get a similar thing, but then you need a very stringent assumption about
probably not realistic about how to group it. So I give the writ [phonetic]
here. But if you look at the writ here, the convergence writ for this one here
is N to the minus 2. Here is the lambda you can choose it. So you can set
lambda to be in the power of minus F and the daughter N actually is the writ of
the diameters.
>>:
So it's H star [indiscernible].
>> Jian Zhang:
>>:
How about 0, or is that just a typo.
>> Jian Zhang:
>>:
Yes.
Which one.
On the left-hand side of the bottom expression.
>> Jian Zhang:
This one.
>>:
Yeah.
>> Jian Zhang: Yeah. No, it's a supreme over all the H. Sorry. This is a
typo. Sorry. Yeah. So this is essentially the R star. That should be the R
star. Yeah. And so here is some discussion. If we take lambda M to be this,
this guy, N to the power minus half then this converges to rate at maximum of
these two quantities. And consistent, of course, automatically follows, and
one thing we want to emphasize is in practice often there's a limit on M,
depending on what type of computer you use, whether it's personal or server or
some small device. In that sense if you have limit of N the daughter N can be
well contoured. Now, if you at most take like 10,000, you can in fact make
daughter N to be small by throwing away those things that don't include it. In
some sense you are not forced to use all the data, essentially, if some of them
don't apply.
And also there are some possible improvements. The rate obtain in theorem 2 in
fact it can be improved for specific loss function. At least for the hinge
loss. Now for square loss I can show that's the rate you get. You cannot
further improve. But for hinge loss, you can actually improve that, because of
particular piecewise structure of the hinge loss function.
And this is about the risk. I mean, most cases you are interested in the risk
estimation. And but sometimes you also want to ask a further question, which
is how about the convergence of the classifier to the H star, the best you can
achieve in class.
With respect to certain metric, of course. The LM metric, LP metric. And so
in order to get such result, you typically need some identifiability
conditions. Because if just the risk converging, you do not, in general,
guarantee the convergence of the argument. So what makes the argument
converge? This is very popular condition proposed by Sebock [phonetic] 2004.
Called the low noise condition.
So basically this is saying that the probability mass, if you draw
decision boundary, the probability is small. And it's less than T
of average. This is a very nice observation. This is a very nice
because this one I think in some sense it points out one important
between you do regression and classification.
the bayesian
to the power
condition,
difference
If you think about regression, as we know that you can use the same thing.
Like Lee squared for regression classification. But when you do it for
regression, you look at the boundary.
Regression is something that, once you draw a certain boundary, now the density
for points around the boundary will be high because you have this assumption.
Now, this is -- but this is not necessary if you think about realistically,
what will the density look like. This might not be the true case, the true
case it might be like this. There's low density about the region close to the
point identifier boundary. If this is the case you can make this assumption.
And under this assumption it is not very hard to show actually you can get rid
of convergence of each M tilde to the H star with respect to L1 metric.
>>:
Is it the true density of the estimator.
>> Jian Zhang: The tilde is the true condition density of the Y to the X.
This is the true one. And in this case I think it's about the rate of
convergence for the risk to the power of R divided by R over 1 minus R.
And from that, the next question we want to answer is how do we obtain those
groups? And can you find a very efficient [indiscernible] because if this
takes a whole lot of time, then it may not be worth it if the argument is about
computation. So can you efficiently obtain the local groups. Here is a simple
algorithm we use, actually. This turns out to be essentially a one-step key
means. And I'll talk about the difference between the algorithm we use for
grouping and for clustering on the next slide. The basic idea is you randomly
select a seat among the examples, from the examples. Sale ten million, you
randomly select 10,000. Those will be [indiscernible]. Then you calculate the
pairwise distance between the theta and the example and put the example to the
closest state and you take the average. This is the weight function you
specified. This is how you do it. Essentially you do a one step
[indiscernible]. It's very simple.
And if you are using that method, okay, and you, with some other conditions you
can show that if you are doing that, then indeed the average diameter of each
group indeed converge to 0. In the probabilistic sense. So if the support is
bounded support in the RP, then the rate actually -- this is average distance.
It indeed is big O PM to the power of minus 1 over P. So it did get worse when
you have higher dimensions, but it will converge to this in this case. As I
mentioned in the early case, in practice this always improves over the random
down sampling. The reason is you do not necessarily need to classify it to put
everything into each group. You can stop if you feel that if you have limit on
M you can stop whenever you feel that the distance is large.
Now, the previous algorithm is very simple one. The time comparison, if you
look at it, is simply M multiply N, which simply means if M is large and N is
not small then it takes a lot of time to do the computation, to do the distance
calculation.
And here's one algorithm that can perform the computation much more
efficiently. This is simply a hierarchical version of what we used earlier.
So the idea is you randomly select some state at the top the root node and then
you do that based on the distance you root them into let's say ten nodes and
for each node you do the same thing again.
Eventually the number of nodes for each inner node is MI. MI is -- and you
only need to choose MI such that the product is M. So this is all required.
Now, if you are doing it this way, eventually the end, the leaf node, you get
small groups of the observation, then you can compress them. You output the
compressed example.
Now if you compare those two results, so the first one, as I mentioned, the
complexity is M multiply N. And the second one is essentially N multiply M to
the power of 1 over D. So this can be much faster if M is also not small
number. And in practice actually what we found is for very large -- even for
very large dataset you can simply take D to be 2 or 3. That's good enough to
give you efficient grouping algorithm.
And here's some comparison to the
comparison is simply one step key
here, the difference for the cost
classifier, typically you want to
data points send spot.
algorithm. As we mentioned the first
means. So now the difference essentially
mechanism is not this. When you do
find do partitions. That group all closed by
Now, when you do this, all these require, if you look at the assumptions, is
that we require the diameter of each group to be small. We didn't see anything
about overlapping. So in other words you could actually have groups have sets
that are on top of each other.
Now, as long as within each group is small, you are fine. It does not matter
that you have overlapping. So that's actually -- that's one of the reasons
which makes it possible that we can do it in one pass. And also it can be
easily computed in a distributed way, because you simply just put it into
several chunks and you do that together and that would be fine. Because for
each of that at least you guarantee that the diameters of each group is small.
There's no requirement that you have to find all the data in the neighborhood
and put it in one set.
Okay.
So here are some experiments.
Yes?
>>: I want to ask a question about your analysis again. So anywhere in the
analysis did you assume that the true function that generates the labels
belongs to the class that you're searching in, belongs to the covered space?
>> Jian Zhang:
No.
>>: Is that somehow implied in one of your assumptions?
agnostic in your assumptions.
Are you totally
>> Jian Zhang: I don't think that's assumed, because here we are focusing on
the estimation error part. We don't touch approximation. If it's in the
class, you have approximation error large. That's something in some sense you
can't control. You have to pick up the function class which is large enough to
make approximation error small. It's mainly about the estimation error, the
analysis.
>>: You can reason about the estimation area either with an agnostic
assumption or not agnostic assumption. Agnostic assumption would say I can use
anything or you could use the fact that the hypothesis ->> Jian Zhang: If you know that fact, I think you can do something better.
For example, you can use the local market complexities and make these bound, if
you derive bound you can make bound sharp area. In the case of estimation,
that's where it can greatly help essentially. If you know the true conditional
probability density is well conditioned in the sense that it really has low
noise with certain upper values, then you can get much faster rate of
convergence for the estimation even though you have the same rate as
convergence for the risk.
So the method we're going to compare are the following. So this four is you
basically train using all training data you have. And this dye is your sub
sample subset, and you trend the algorithm classifier. Now the comp, the
compression, the first one, is you do this -- you construct compressed example
but you only use random grouping. So you randomly select a subset and you
average them. And this one is you do the local grouping, use our algorithm.
This is down -- when you are doing classification, this is down for positive
and negative separately. And then we also constructed, tried to show what if
you're doing the grouping for all examples together? You don't separate them.
So what will happen? Although the theory does not suggest in doing that, but
we also did this anyway.
And the dataset we used are one simulated data set for UCI and two large scale
datasets. So the first one is -- you know this is the simulated, so we know
how the data looks like. We generate data based on this logistic model and we
simulate XY, XX and Y giving the true parameter error. Here you can cut to the
error which is .246 and we're trying to compare this solution of the
performance of four algorithms using different NM setting. So N will be the
full set and will be how much it compressed you or how much you want to sample
for in the case of down sampling. We try to keep M a small number in this
case. You will see that the base arrow is pointing to 246. The four, of
course, is the best performing -- if you were able to conditionally use all the
dataset. This one actually gets pretty well if you use a local grouping, it
gets pretty well when you compare with this one or also compare with biggest
error actually.
Even if you have just M, you cause 20, as long as you're allowed to do a lot of
compression, you're doing very good in fact in this case. The down sampling is
not performing grid because you're losing a lot of information. In this one if
you don't care about wire labels, in this case it also converges, but it does
not do as well as if you separate the Y labels.
So this is a very simple toy example I want to show. So indeed it performs
quite well. And we also try this for these datasets. So here are some
statistics. And those again are not large datasets. Those are pretty small,
at most medium-sized. We use labor as well for this set with RBF kernel and do
a tenfold cross-validation. Because the dataset is not large anyway, we set D
equals 1. That can easily handle it. And we change the setting of M over N.
So this can be thought as the compression ratio. How do you want to compress
it? It's from .1 to .5. And here are the results for the first two so the
last one is E. Coli [phonetic], the second one is glass.
performance.
You can see the
The bottom one, this is the one you obtain using the full set. So this is
probably -- you can see this as the upper bounded, you can achieve the
performance, use the whole set. And in both cases, in this case, well, the
compressed with local grouping does the best among the other three. And this
one, even though for the linear regression case, it does not show any
advantage, but for this case the pink one is the random sample A. For those
two datasets it performs better than you do a down sampling. But still, of
course they're not as good as if you compressed using local grouping
information.
And this is for the other two datasets. Now in this case, for this one,
actually, this is interesting. This is, I think, is the spam-based dataset.
And this one, in this case the down sampling is performing better. So actually
it's pretty close to the local grouping, except for the tail part. And it's
also interesting to see actually this one it does better in the end when you
have like a .5, which means you compress every two into one. And it does
actually better than if you use everything here.
>>:
Is it a pretty unbalanced dataset.
>> Jian Zhang:
>>:
Which one.
Is it a pretty unbalanced dataset.
>> Jian Zhang: This class base? The front base I think is two classes.
can't remember exactly if it's very unbalanced or not.
>>:
Not unbalanced but would you want to adjust the grouping methodology to --
>> Jian Zhang:
>>:
I
To have a different work group size.
Possibly, yeah.
That's one way to do it.
>>: Jian Zhang: I think it's possible, in practice, at least.
tried it the same, for both positive and negative.
I think we
>>: I feel like if you have a rare class, compressing the rare class is much
more costly than compressing ->> Jian Zhang: Yeah, I agree. And this is the result. This is a result
applied to two, I would say, large datasets. So the first one is image
dataset. It has about 75,000 images. And from some broadcast captured from
some broadcast. And the task actually is trying to find the banner label,
whether there are -- person appeared in the images. So this is -- and the
second one is the after dataset. This has half million training examples.
This is the one actually we took from the machine learning large computation
challenge. So this is one of the datasets they used for the challenge.
And here we use what is suggested from the large-scale challenge. We
essentially report the classification rate, we report the area over the
precision recall curve. So this Y will be the area over the precision recall
curve. And this is actually the result again for the format search and here
because the dataset is quite large we can actually use compressed quite a lot.
This one, for example, is found .001 to something like .1 or .2. This is from
.01 to .2 in this case. So as you vary this, this one, you can see how
relatively those methods are performing comparing to each other for this
dataset. This one I think is a little bit weird. I think this dataset, I
expect this data is some fake one. Is probably just the randomly generated
because they don't mention what word does this one come from.
>>: Are the groupings, other than the class label, are they identical for the
positive/negative region? In other words, is the partition essentially I
choose this block of space and I'm going to have one group for the positive
data average out in that part of the space and one group for the negative data,
or can they just be disjointed at the positive representative that came from
this point and the negative came from this point and they may or may not
overlap.
>> Jian Zhang: I think this depends on the data distribution, in fact.
depends on how the data looks like. They can overlap.
>>: In your experiment, what did you do?
you're able to recover.
>> Jian Zhang: Yes.
It can overlap.
Yes.
This
It has implications in terms of what
The algorithm -- for the algorithm it can overlap.
>>:
You're actually partitioning.
>> Jian Zhang: I did not hard look at the partitioning because it's not low
dimensional space. For example, if you have 150 features, it's really reaching
the high dimensional space.
>>:
How did you define the partitions together.
>> Jian Zhang: The partitions we defined is essentially if you think about the
one step K means, if you have a meeting, you want to come up with 1,000. You
randomly sample 1,000. And then you group them. Based on distance, you group
them. This way you can do this in a very efficient way. Okay.
To conclude, so the goal here is we want to reduce the training test time and
the storage for learning with massive dataset. And for this method, actually,
it's very easy to compute, and also you can use existing packages. You don't
need to change anything, as long as you come up with the compressed example.
And you can obtain the consistency and rate of convergence result. And we
showed both empirically and theoretically it could be much better than the
random sampling. And this grouping can be done very efficiently and in a
distributed way in fact. Even though we didn't do it in a distributed way
there's moving in our case you can do it in a very distributed way for a very
large dataset.
Finally, as I mentioned earlier this could be combined with other methods to
really handle very gigantic datasets. If you really have such datasets,
because those methods are kind of orthogonal to the online versions, like
online stochastic gradient because you can compress that and apply those
methods any way and they don't conflict with each other. So that's all. Thank
you.
[applause]
Download