1

advertisement
1
>> Ming-Wei Chang: Hi, so it's my pleasure to introduce Kai-Wei here today.
So Kai-Wei is currently a graduate student in University of Illinois
Urbana-Champaign. So he has done several impressive work along the direction
of solving large scale linear classification model, like SVM, and he's also one
of the main contributors of a widely used package LIBLINEAR. Today, he's going
to talk about his work at changing large scale SVM where data cannot fit into
memory. Thank you.
>> Kai-Wei Chang: Okay. Thanks, Ming-Wei, for the introduction. So today I
would like to talk about how to solve large scale linear classification when
data cannot fit in memory. And part of this work was done with [indiscernible]
National Taiwan University. And another part of this work's done with my
advisor, professor at UIUC.
So let me give you, first give you some motivation of this work, then I will
describe the approach that we use for this problem, and then I will describe
another method to solve the same problem.
So here is the motivation. As we know, in recent years, the size of the data
become larger and larger. So here is an example. If we look at the data
mining challenge, KDDCup, you can see that in 2004, the largest datasets is
only 40 megabtye. And in 2009, the dataset become 1.9 gigabyte. And in 2010,
the largest dataset is even larger, it become 10 gigabyte.
And also significantly large amount of data appear in several domains, such as
in spam filtering, in data stream mining, and in web mining. And also, there
is a report by several internet company that they have a large data need. And
usually if you use more examples and more expressive feature, you can get
better performance, so that's why people care about large data.
So among several method that you deal with large data, linear classifier is one
of the popular method. And here, by linear, we mean that we solve the model
using the original feature domain, and we do not make the data into a high
dimensional feature space. And for certain problem, that's something the
linear model can achieve similar accuracy as good as serving a nonlinear model.
But the training time and testing time are much smaller. And this is
especially -- this especially true for the situation that a number of feature
is large.
2
Sod here, it shows simple that we train the linear SVM and the linear SVM with
RBF kernel on function of dataset. And you can see that although for some
dataset there is a gap of accuracy between the linear model and nonlinear
model. However, for those large dataset, the accuracy between the linear model
and nonlinear model are very small. However, the training time for linear
model is much, much smaller than the nonlinear model.
And here, the time here, I only show the time assuming the data is already in
the memory. So it is [indiscernible] what I want to talk in this work.
>>:
So can you get accuracy of 99.7.
>> Kai-Wei Chang:
Which one?
>>: The final column, mnist 38, the first row.
result before.
>> Kai-Wei Chang:
>>:
No, no.
You mean the [indiscernible]?
In the first row.
>> Kai-Wei Chang:
>>:
99.7 I've never seen that
Oh, the 38.
To the right.
>> Kai-Wei Chang: Oh, this is a binary classification only on the category 3
and 8. I guess you are talking about the [indiscernible] situation. All
theories are binary classification result.
>>: Is linear constraint to only use current variables and not try any
combinations of feature induction?
>> Kai-Wei Chang:
Oh, here I just using original feature space.
So --
>>: Do you know what the number would be with feature induction involved and
you can do linear on induce features?
>> Kai-Wei Chang: You mean using the degree to polynomial expansion or
something like that?
3
>>:
Yeah, something.
>> Kai-Wei Chang:
>>:
Some combination of the features?
Whatever your favorite induction technique is.
>> Kai-Wei Chang: Yeah, actually, we have a paper to describe about that.
can use a polynomial expansion, and then you can trend the model using the
linear model to trend that feature induction. So we have --
You
>>: [inaudible] biggest losers that you have right now, which looks like the
covtype and webspam, did those recover quite a bit?
>> Kai-Wei Chang: Oh, yeah. So for example if you use the trigram of the
webspam, then you can get about 99.3. I mean, even use linear model on the
webspam with trigram.
>>:
This using trigrams only?
>> Kai-Wei Chang:
>>:
So --
>> Kai-Wei Chang:
>>:
No, these using bigram only.
Uni means unigram.
So the numbers are not using bigrams?
>> Kai-Wei Chang: Yeah, these numbers are not using bigrams. This is a
Unigram. I think your comment is true, that if you do some feature induction
here, then you can get better result in between the linear model and other
model. And also, the training time can also be smaller than using a nonlinear
model. That's true.
Okay. So another observation here is that if we assume that the data is
already in memory, the training team here is pretty small. So let's say that
the [indiscernible] linear model is already well developed so they are
[indiscernible] linear model. For example this one proposed using
[indiscernible]. This one using the [indiscernible] method and we also propose
a method to use [indiscernible] linear. However if you direct to use this
method to [indiscernible] situation where data cannot fit in memory, then the
4
training time will be very, very large because of the disk swapping.
So here I show you a figure here. So the X axis here is the size of the
dataset, and the Y axis here is the training time. And we train the model on a
machine with only one gigabyte memory. And the green line here show the actual
memory that we can use on that method. And you can see that when the size of
the data stay close to X memory that we can use, the training time increase
rapidly. And this is because of the disk swapping.
>>: [inaudible] initial loading cause the data from the disk or is it assumed
that you had a warm start, or what is the setup here?
>> Kai-Wei Chang: So this is we just run the LIBLINEAR on the data, and we
don't care about anything. So you will take care of that loading data and also
like do the disk swapping [indiscernible].
>>: Does the linear use threads to do additional loading of data in the
bigram?
>> Kai-Wei Chang: No, no, no. It's just one thread. You're just using one
thread. Okay. And so here, we model the training time as two part. So the
first part is that we need to access the data from the disk to the memory. And
the second part is we need to update a model using the data in memory. And in
the situation that data can be stored in memory, the previous words, assuming
that the data is already loaded. So they only focus on the second turn, and
you can know the first. However, we arguing if in this situation the first
part can be dominated long in time.
So here's an exam pig. If you run LIBLINEAR on [indiscernible] dataset. So
this dataset has about 500,000 instance, and it's about one gigabyte. And in
that situation, it take about one, only one minute to load the data into
memory, but it only take about five seconds to solve the model. And people
only report that they solve the model using five seconds.
And in a situation that if data cannot fit in memory, then you cannot just load
the data into your memory once. You might need to only load a portion of data
at a time. In that situation, the cost of a disk access might be even higher.
Okay. So our goal here is we're trying to construct large linear classifier to
handle the data larger -- handle a situation that data size is larger than
5
memory capacity. And here we only focus on training on one machine. And we
make assumption of our data, that we assume that the data size is larger than
the memory capacity, but it can be stored in the disk of one machine. And we
also assume that for the data using the sub-sampling technique will cause a low
accuracy. And we will show that in this experiment, that the data we are using
are [indiscernible] by these two assumptions.
So here is a condition of a viable method in my mind. So because the data can
be only stored in the disk, so we cannot avoid random access to a disk.
Therefore, at each time the method can only load a continued chunk of the data
from disk at a time. And second, that we are looking for exactly solution of
the linear model. So we require the optimization procedure need to converge
toward the optimal.
So this is not talking about just getting an approximate solution of training
the model.
And third, because we know that this access is very expensive, so we should
reduce the number of times to access the data from the disk. So that the
number of [indiscernible] should be smaller, should be small.
And also, we are trying to find a simple solution for this situation because we
want to support several functionalities, such as the multiclass classification,
using like one versus all strategy and also can doing something that parameter
selection and support other functionalities, such as incremental, decremental
learning.
>>: [inaudible] does it mean like mini batch methods?
mini batch methods would have no accuracy?
You're assuming that
>> Kai-Wei Chang: No, I'm assuming that if you just random select something
from the whole dataset, then you train the model on this random sample, then
you will get a lower accuracy than training on the entire dataset.
Okay. So in this talk, I'm focused on training the linear SVM. Although our
method can also apply to [indiscernible] and other similar formulations such as
a multiclass formulation by [indiscernible]. And here, given the trains data,
YSI, and SI is the end dimension factor. And Y is labeled, and it can be
positive one or negative one. And here the N here is number of features and L
is the number of data.
6
So SVM formulation, the variable here is W. So the number of variable is equal
to the number of feature, and the first turn people call it [indiscernible]
turn. In the second turn, it's the loss function and C is the parameter to
variance these two turns.
And it's well known that this [indiscernible] is equivalent to the
[indiscernible] SVM formulation. And in the [indiscernible], the variable is
[indiscernible] and the number of variable is equal to the number of instance.
And each variable [indiscernible] corresponds to one data point SI. And the
[indiscernible] function of the [indiscernible] SVM is [indiscernible] problem,
and the constraint here is a bound be constraint. And the Q metric here is
equal to, Qij is equal to [indiscernible] times SISA. So the [indiscernible]
entry of Q is related to the I instance, and the [indiscernible] instance of
the data.
And in the optimal solution, there is a relationship between the optimal
solution of the [indiscernible]. In the optimal, the W is equal to the linear
combination of the data.
Okay. So let me talk about the first method we proposed. So the idea here is
pretty simple. Because we cannot load all the data into memory, so we say we
want to split the data into several chunk. And at each time, we load only one
data chunk into a memory. So is our algorithm that we first split the data
into several block and store the data into several files accordingly.
And then we give an initial model, and we look until the model get converged.
And at each time, we load one data block from the disk to the memory and
conduct some operation on this data to update a model, and then we do it again
until the model get converge. And I know I haven't talk about detail. And I
will talk about detail later.
So this algorithm is related to the block minimization method, which is a
classical optimization method using the data mining machine learning area. So
in block minimization method, they consider a block of variables at the time
that has been widely used to solve many formulations, such as to solve linear
SVM.
And here, in this situation, the data is larger than the memory. So even from
previous work, we cannot avoid random access to the disk. So we cannot do the
7
[indiscernible] variable selection.
of the variables.
So here, we just using a fixed partition
So there are two remaining issue here. So first, we need to decide the block
size that we are using. And the second is we need to specify which operation
we [indiscernible] to update a model. So let me go to the first issue first.
>>: Can you go back two slides? One more slide. Over here, for the Q -- one
slide back here. I think according to what is right there, the main problem is
the Q, the Q square.
>> Kai-Wei Chang:
Yeah, yeah.
>>: So but that problem is already solved, you assume. So the problem is the
data to memory or the square of the data that's [inaudible].
>> Kai-Wei Chang: No, assuming the data is not fit in memory.
measures also cannot fit in memory also.
>>:
So the Q
But you assume [indiscernible].
>> Kai-Wei Chang: For the Q? Yeah, for the Q, the [indiscernible] several
method using [indiscernible] using the XML method and in linear situation we
have a dual [indiscernible] and basically only consider one [indiscernible] at
a time.
>>:
So the Q might be solved.
>> Kai-Wei Chang:
>>:
Yeah, yeah.
And I will talk about this also later.
Okay.
>> Kai-Wei Chang: So basically, we don't compute all the Q entry and store in
the memory. We compute the entry of Q when the time we need it, yeah.
Also, let's go to the first issue. So here, we make an assumption that you
assume that each block has a similar size. Each block, the sizes are similar.
And here is our conclusion that we say the size of block cannot be too large,
because each block of data need to be fit in memory. And also, the size of the
block cannot be too small. Otherwise, we might need lot of time to loading the
8
data.
And here is our [indiscernible]. So here, we use a simple preparation of the
[indiscernible] so we only can see the [indiscernible] of the one other
iteration. So the capacity of one other iteration is this formulation. So it
can be divide by two part. The first part is a time to load each block to the
memory and the second part is [indiscernible] operation on the data in memory.
It's times the number of blocks that we have.
So in the literature, that when people know this turn, they only can see the
second turn. And in this situation, because usually the operation come down on
this part of the data, it's more linear to a size of the [indiscernible] so
they get a conclusion that using the most faster block to update is better. So
that is why in LIBLINEAR, [indiscernible] choose the block sizes and in SVM
they choose the block size N.
However, here the story are different, because we also need to consider the
first turn. And moreover, the first turn almost always dominate the second
turn. And in this situation, because we know that S is a block of data from
this, it needs some initial [indiscernible] and do something there. And also,
need another turn that related to load the data from this one by one. So this
another turn is [indiscernible] to the size of data.
For an outer iteration, the running time could well become the initial cost of
the access times the number of blocks times the [indiscernible] with respect to
the size of data. And that is not related to the block size. So only this
turn related to the block size. And in this situation, you can find that if
you're using large block size, it's better.
So here we get a conclusion that when the data cannot fit in memory, we should
lose a block size as large as possible. And we will also show this in our
experiment to confirm that it is also empirically true.
So the second issue here is that we need to specify how to update the model
using a block of data, and here we propose two method. The first one is
using -- to solve the two SVM by using LIBLINEAR in each block. And the second
one is to solve primal SVM using the stochastic rate and descend in each block.
Okay. So for the first method, if we look at the dual function, we can -- as
we say, that each dual variable corresponds to one data point. So there is a
9
natural combination between a block of data to a block of variables.
So here, we say we want to solve a sub-problem such that we update those
[indiscernible] variable and fix the rest. Then you can [indiscernible] this.
So in this sub-problem, keeping the current model, we're trying to find
updating the [indiscernible] that those variable that corresponds to a data not
in the memory of fixed.
And also, the updating you need to satisfy the bounded constraint. And if you
[indiscernible] into this formulation, then you can get this solved problem.
So then next, we want to verify that solving this problem only need to use the
data that's already in memory.
So let's look at this solve problem. So there is a reminder that the entry of
the Q is related to the I instant in the [indiscernible] instance. Okay. So
what is the first term of these? So the first term basically is this
[indiscernible] and this whole block is a Q method a [indiscernible] of the Q.
And you can see that using the two access, these entry only involve the Dat
that already in memory. So this is good. And the last turn here is basically
constant too, that also we can ignore this. The problem is the second turn.
Because the second turn, we have [indiscernible] between the [indiscernible] of
Q and times the [indiscernible]. So basically, this one requires access all
the data. However, we find that we can use a tree to maintain a temporary
factor Q so that this [indiscernible] only requires to use the data in memory.
So here are some detail. So basically, we use a [indiscernible] proposed in
our paper that by maintaining this temporary factor W, then you can find that Q
[indiscernible] is actually equal to Y times the [indiscernible] between the W
and X.
>>:
Does that tree apply for nonlinear SVM too?
>> Kai-Wei Chang: Actually not. Because we need to use a property that
[indiscernible] -- yeah. So the idea is that if you just substitute this into
this one, then you can get this equation.
Now, so if you look at these, to get the [indiscernible] is actually only
involve the R instance. So that to compute [indiscernible] only involve the
data that already memory.
10
>>:
So you don't have any solution for nonlinear?
>> Kai-Wei Chang: Yeah, we don't have that, no. If we had it, [indiscernible]
classification, yeah. Also, there is a good property here, because we know
that we only updated alpha corresponds to data already memory. So updating
this W only involved those data that already in memory. So basically, after
this slide, we can safely say that to solve this problem, we only need the data
that already in memory.
So now we can inject this into our previous algorithm. So for solving dual
SVM, when solving -- after we load the [indiscernible] of data into memory,
then we approximate the sub-problem to obtain the updating direction. Then we
can use this direction to update our model and also update the temporary factor
W.
>>: So I think something here is no matter how you block the data, the final
result will be the same because it's [indiscernible].
>> Kai-Wei Chang:
Yeah, yeah, yeah.
Also, we have improved the convergence.
>>: But does it take longer to get converged if you block the data some small?
Here, you just assume [indiscernible] access time [inaudible].
>> Kai-Wei Chang:
>>:
Like you --
You have more blocks there, it may take longer.
>> Kai-Wei Chang: Theoretically, the convergence rate is the same. However,
that for practice, that if you block the data too small, then the
convergence -- I mean, you will have a lot of blocks and each block, data are
not communicating with each other so the convergence will be slower. That's
true.
Okay. So here, our problem can be solved by any bounded constrained method,
and there we used LIBLINEAR to solve this problem and LIBLINEAR implement a
[indiscernible] descent method.
And in practice, we usually only find the approximate solution of the
sub-problem. So here we need to say something about the stopping condition and
11
also the convergence.
So we propose two approach for the stopping condition for solving the linear
sub-problem. The first way is that we can use a fixed number of passes to let
data in memory. So, for example, you can say we go through all the data in
memory, five times, ten times and update the model using each of the instant at
the time.
And the second way is that you can define some gradient-based stopping
condition. For example we can use the one in LIBLINEAR, which uses projected
gradient -- normal for projected gradient. And if normal for projected
gradient is less than 0.1, then it stop.
And, of course, you can use some combination of these two stopping conditions.
And we can prove that convergence hold for both these two approach.
Okay. So next I want to talk about how to solve the primal SVM in this
framework. So in primal SVM, as we know, each variable is corresponds to one
feature. So there is no relationship between the variable to [indiscernible]
instance. However, we can use the stochastic gradient descent, because in
stochastic gradient descent, they're only using a small set of data to update a
model.
And this will be very closely related to online learning method. And here we
consider to use Pegasos for experiments. So in Pegasos, there's a
[indiscernible] mode. So one is that they can update a model using a block of
data. And also, you can update a model using only one data.
>>:
This is primal?
>> Kai-Wei Chang:
This is primal.
>>: Why do you want to use primal?
primal?
>> Kai-Wei Chang:
>>:
[indiscernible] is it better to use
No.
That's what I thought.
>> Kai-Wei Chang:
The reason I show this slide here, because people doing
12
online learning, they usually think that we can do this using the primal and
using some online learning method. So in this parameter, we want to show that
using the dual method we propose is actually more faster than this primal
method.
>>:
Dual method can also be used with online learning.
>>:
You mentioned it can be used in online learning.
[inaudible].
>> Kai-Wei Chang: Yes, this can be used in online, yes, that's true.
mean, it can have the online version, but the -- yeah. That's right.
>>:
This is just for comparison.
>> Kai-Wei Chang:
>>:
But I
Yeah, it's for comparison, yeah.
[inaudible].
>> Kai-Wei Chang:
That's true.
>>: Although there are some papers that I read that there's a great advantage
to use primal. Maybe in the kernel, nonlinear version. [inaudible] the second
chapter, they talk about primal method may be better.
>> Kai-Wei Chang: I'm sure that, I think that's both different dataset, they
might be different. You might get different conclusion. So for example, we
have an observation that if number of data are very large, then you will have a
lot of number of variable in dual form. In that situation, sometimes using
primal method will be better.
>>:
Although you might be --
>> Kai-Wei Chang:
Oh, yeah.
Yeah, it's only [indiscernible] consideration.
>>: Using the primal method, actually you don't need the blocks. You can
stream the data from the file and [indiscernible]. If you use [indiscernible],
then actually you [indiscernible] will be the speed of the disk.
>> Kai-Wei Chang:
Yeah, that's true.
Yeah.
13
>>:
Can you talk about that?
>> Kai-Wei Chang: Yeah, I haven't shown you the result using multithread. But
because the [indiscernible] is the disk. So we kind of arguing that in the
primal method, if you're using those online method, they only consider one data
point at a time. So then you might not data from this once and only update a
model only once, then you need to do throw away this sample.
So in this situation, you need multi data access from disk.
>>:
[inaudible].
>> Kai-Wei Chang: It's not the theoretical result. I mean, in practice this
will be slower because you need multiple access from disk.
>>:
So you need to go through the data more than once?
>> Kai-Wei Chang: Yeah, yeah. If you want to get -- actually, in the end of
this, I will show a result that only go through data only once, and the method
I propose will get almost accurate as you go through data many times. But the
primal method will get some suboptimal solution.
>>: Using the typical effects like [indiscernible], wouldn't you converge
fairly fast just because the learning rate will become small enough that you
don't need to pass anymore?
>> Kai-Wei Chang:
No, depending on --
>>:
[inaudible].
>>:
Memory, right?
>>:
Yeah, the memory --
>>:
This is like 30 gigs plus, right?
So this is the --
>>: It's supposed to have a learning rate of [indiscernible]. So depending if
you're running out of the memory, because of the number of features or the
number of results. If you're running because of number of features --
14
>>:
But if it's sparse, then you likely --
>> Kai-Wei Chang:
>>:
Of course, it also depend on the data.
[inaudible].
>>: The [indiscernible] of Pegasos is the [inaudible]. So if you have giant
disk where it's [inaudible], learning rate of [indiscernible] you can't get too
far from [indiscernible]. So the only thing that would happen is your dataset
is very large, but this is due to the fact that every example takes a lot of
memory. So I mention every example takes one meg. Then even if you have 1G,
it's just [indiscernible]. So at this point, you really, really want to use
every bit of information that you have. But this is not the case, I think the
theory suggests that it would be very, very close to [indiscernible].
>> Kai-Wei Chang: Yeah, and also, you know, I see in our paper that we use the
dual [indiscernible] method, and we also compare the convergence rate between
our method and the Pegasos. So the difference that if you use the old
[indiscernible] method, because it's a [indiscernible] method and the
convergence rate will depend on the size of data. And in Pegasos, the
convergent rate does not depend on the size of data.
But that if you consider the convergence rate in terms of the accuracy
solution -- I mean the [indiscernible] between the current model and the
optimum model, in [indiscernible] value, then the convergence of the
[indiscernible] method is faster than the Pegasos.
>>: How did you find the distance from the optimal objective?
[indiscernible]?
>> Kai-Wei Chang:
by --
Oh, yeah.
That is a theoretical analysis.
Did you do a
What do you mean
>>: [inaudible] trying to find the distance from the [indiscernible], how did
you choose your optimum? How did you know what the optimum is?
>> Kai-Wei Chang:
I didn't get.
>>: I mean, when you find the absolute distance of your [indiscernible], how
do you calculate the [indiscernible]?
15
>> Kai-Wei Chang:
>>:
You mean in practical or in theory?
When you actually performed the experiments.
>> Kai-Wei Chang: Oh, so we basically run the algorithm on the data several
times until the duality [indiscernible] is small enough.
Okay. So we're arguing they're because that for the algorithm that Pegasos,
when you load the neg [indiscernible] into memory, you can only perform one
update using one data point in the data in memory. You cannot perform several
update on those data. Because those method can't really assume that the data
is uniformly [indiscernible] on the whole dataset. So if you just update the
data in memory several, a lot of time and get converged, then you will converge
to the model that only turn on this part of the data.
But in the dual method, because the data points is corresponds to the
variables, so you can exactly solve this problem. So that is the main
difference here.
Okay. So here are some implementation issue. The first one is we find that if
you compress the data and then store the data in a compressed data in a disk,
then the loading time can be smaller. And this has been shown in several
package that in [indiscernible] they also do these kind of things.
>>:
These are in binary or in plain text?
>> Kai-Wei Chang: In binary. Because we want to several space. And the
second one is we find that if we just split the data, then for some dataset,
the [indiscernible] data is ordered by label. So if you just split the data,
then you will get the [indiscernible] with the same label, then that will kill
the [indiscernible]. So then we say the random split in the initial is needed.
So we need along an algorithm to do the [indiscernible] read in data
compression in the situation that data lodges in memory. Then we show one of
the methods in our paper. And also we show in the paper that our method is
simple enough so you can perform several functionalities such as for cross
validation, multiclass classification, and incremental and decremental
learning.
16
So let me show you some experiments result. So this is several dataset that we
experiment on. And the largest dataset here is webspam, and uses about 16
gigabyte and we use this on a machine with one gigabyte memory. And so the
largest data is about 16 times larger.
Although some people will say that now we have much larger machine, but we kind
of trying to [indiscernible] this situation in only one gigabyte machine. So
[indiscernible] data. So here we show that on these three data, if you just do
some random subsampling, then you will get suboptimal results.
So here, the X axis is the percentage of the whole dataset we used and Y axis
is the accuracy. Difference to the best accuracy in this dataset. And you can
see that ->>:
I have seen many algorithm that allow you to select data optimally.
>> Kai-Wei Chang:
>>:
Yeah, yeah.
That's true.
So that's what we use to compare rather than --
>> Kai-Wei Chang: Actually, in my second part of talk, will relate to this
one. Yeah. So basically this one, I just want to show you that if you just
get a subset of the data that can fit into the memory, then you will get a
suboptimal result.
And we compared these four methods. The first one is block minimization method
solving the dual using LIBLINEAR. And we go, using LIBLINEAR to go through all
the Dat in memory ten rounds as a stopping condition.
And the second one is Pegasos, and at each time, Pegasos pick one data in the
memory and do update.
And LIBLINEAR is the one that we just used standard LIBLINEAR to turn the model
and we also compared to the VW. That is a well known online learning package,
and we also used a version 5.1.
>>:
And when you go through all the data points, do you randomly choose?
>> Kai-Wei Chang:
We manually go to one data point, yeah.
17
>>:
[inaudible] Pegasos?
>> Kai-Wei Chang:
>>:
Does it help?
>> Kai-Wei Chang:
>>:
Yeah, yeah, yeah.
Yeah, okay.
I don't know.
Sorry.
I just use the package.
It costs a lot and the option does not help.
>> Kai-Wei Chang:
It cost a lot?
>>: It's expensive and it's known to not actually help much in many
applications. It costs a lot in terms of cycles, I'm saying.
>> Kai-Wei Chang:
>>:
Oh, okay.
Once you get a decent [indiscernible], it doesn't matter.
>> Kai-Wei Chang: Yeah, yeah. Disk is very -- that's true. Okay. So here is
the result. So in this figure, we show the function value reduction. So the X
axis is training time and the Y axis is relative function value difference to
the optimal solution. And the reference model is get by solving the data, the
entire data several times until we get a converge.
So then we can use the optimum function value. And both of the X axis and Y
axis you see here not scale. So you can see that if you just use the LIBLINEAR
so solve this model, then you become very, very slow because of disk swap. And
the blue line or in here, we include all the time including the initial split
and compress the data. So the comparison, we trying to make this comparison
here.
And in the black line here is the block optimization method with Pegasos and
blue liner is block minimization method with LIBLINEAR. So we're trying to
argue that using the dual, solving in the dual can be faster.
>>:
[inaudible] did you get to the disk speed area?
>> Kai-Wei Chang:
Disk speed?
Sorry?
18
>>: Yeah, did you get to the
also of implementation. Once
implementation. You can take
method for that, which is the
>> Kai-Wei Chang:
disk speed area. So, you know, it's a matter
you compare time, it's also a matter of
[indiscernible]. So here, there is a clear
speed that you read the data from the disk.
Oh, yeah, yeah.
>>: For example in Pegasos, was the disk always at 100 percent, or was it 50
percent?
>> Kai-Wei Chang: Yeah, I forget exactly number of the disk speed. But we
show it in the paper. Yeah. And also, again, that we use some machine is
pretty old, from 2009. So now that you [indiscernible], we can have a machine
that have a faster disk speed now.
>>:
You used the same machine?
>> Kai-Wei Chang:
>>:
Your variable is the implementation?
>> Kai-Wei Chang:
>>:
Yeah, we used the same machine for all these parameters.
Yeah, yeah.
This is kind of.
--
>> Kai-Wei Chang: Yeah. So here we also compare the result to the online
learning package, VW. So VW has a very good implementation of the stochastic
gradient descent kind of method, online learning method. And we also show that
in this situation, our model can get a final accuracy more faster than the VW.
And here are some experiments that for ->>: Was there a vanilla [indiscernible], or was it VW with all the VW tricks
like the adapted learning grades and so on?
>> Kai-Wei Chang: Yeah, we used the VW that you say adaptive learning and
those things, yeah. We also play with the linear a little bit, and also
[indiscernible] to make the experiment more.
So here we show some other experimental results to compare our theory.
So this
19
one, we say that if you accidentally get one [indiscernible] of data with
[indiscernible], you basically will kill the convergence rate, you will
converge very, very slow. At least one, we show that the difference between
block size method. So here, this is -- X axis is the training time and Y axis
is [indiscernible] difference to optimal solution.
And the red line here is, we're using -- we separate data into thousand blocks.
And green line here, we separate data into 400 and this is 200 and this is 40s.
So this is confirmed that using some more number of the blocks or the large
size of the data block can get better result.
So in conclusion, we have proposed a method that can handle the Dat 16 times
faster than memory. And this paper won the best paper award in KDD 2010 and we
have a more complete version in the TKDD. And then we find each framework can
be further extended and that is why we're trying to propose a multiplier
algorithm for the same task.
So after doing this work -- oh, sorry.
>>: [inaudible]. So is it possible to extend it to [indiscernible] so if you
have -- if you want to load the different blocks in different machines.
>> Kai-Wei Chang: Yeah, so I finish a paper this year that arguing that if you
have the data larger than the memory, then you should put them into several
machine, and then you can do some communication and this can avoid to access
the data from several times because each machine can store approximately
[indiscernible] memory.
But yeah, that's true. That if for some situation, that thought might be
better. But we're still arguing that sometime the communication might be also
higher and also -- of.
>>: [inaudible] do you think you can extend it in some way to allow it to work
on a [indiscernible].
>> Kai-Wei Chang: Yeah, that's a possible. Actually, I am thinking about
that, but I haven't get a solution now. You can also think that if you have a
very big data but you have only -- you have that limited number of machine,
then maybe each data in each machine still cannot fit into a memory.
20
>>:
[inaudible].
>> Kai-Wei Chang:
>>:
Asynchronous based on W would work on this method?
>> Kai-Wei Chang:
>>:
Because you could --
Yeah, you could do each situation, basically.
I mean [indiscernible].
>> Kai-Wei Chang:
>>:
Sorry?
Yeah, yeah.
Yeah.
Do you have [inaudible].
>>: Likely, yes.
/P-FRPL.
If you're very sparse in asynchronous [indiscernible] yeah.
>> Kai-Wei Chang: So we are still thinking about how to reduce this access
time here. So the one way that you can apply some compression to lessen the
loading time, which we did before and also you can maybe concede some
compression that do in feature hash and doing something to reduce the data
size. But another way to doing this is that we find that if you can better
utilize the memory in learning, then you can possibly deduce a number of
iteration to get an accurate model. Then you can use less time to access the
data. So algorithms, many focus on this second one. So the idea is that if
the sample is likely to be important, then we want to cache it in a memory so
that we can spend more time and more effort on those samples. So if you're
familiar with large margin method, then the intuition is we want to cache those
simple factor in memory so that if it's true, then you just need to do one
update and you are done.
So based on this intuition, we proposed a selective block minimization method
and we shorthand it like SBM. So we can show that SBM has a several good
properties. So the first one is that SBM can -- SBM can save the disk access
by using less number of iterations. So it is very simple. If we're using the
SBM on the rapid spam dataset, then you only need to load data from this only
once and do the update, then you can get as accurate a result as the model
exactly. And as a result, the method is very efficient in this scenario.
And the second properties that although our method catch the data in the
21
memory, so it's select the sample, [indiscernible] uniformly, but we can still
prove these methods converge to optimal.
Okay. So again, we using the linear SVM as an example and again we can only
use this method to solve dual SVM, because we need the relationship between the
dual variable and the data.
So what we're doing wrong in the block minimization method. Because we do the
random split in the initial phase, so you can consider that important data are
spread in several block and each block has only a small portion of the
important data and large portion of the unimportant data.
So then we waste time and memory on those unimportant data. So we're trying to
collect all these important data into cache, then we can do better update.
However, the problem is that we don't know which sample are important before we
are solving of the model. So then we /KWRAOED a way to find and cache those
methods during the training process. And our solution is that we split the
memory into two part. In one part, we load a new data block into this part of
memory. In another part, we use for cache the sample. And then we can churn
our model on both this data and can get a better model.
So here, I use the animation to solve this algorithm. So suppose given a large
set of training data, and we're trying to find a separator to separate the
circle and triangle. And here, we assume those data point close to the final
separator [indiscernible] is more important and we [indiscernible].
So in your left-hand side, we show the memory usage of the block minimization
method. In the right-hand side, we show the memory usage of the selective
block minimization method. And in the first step, both two methods loads block
of data into a memory. And for [indiscernible] comparison, we assume the
block -- the data block in the SBM is smaller than in BM, because you need to
reserve some memory for the cache.
So then both method, binary classifier and then for SBM, you catch those sample
close to the current margin and [indiscernible] and then you can cache it in
memory.
And then we do it again and again. So you cache more and more samples. And
also, you remove those samples that he doesn't think that is important anymore
22
from the cache.
So after a few iterations, we get most of importance into already the cache.
>>:
[inaudible].
>> Kai-Wei Chang: Yeah, yeah, basically, that's true. Yeah. And then at this
point when you trend the model using the data in memory, the SBM can update a
model using more important data. So then you converge much faster. So this is
the key algorithm here.
Here, I highlight the difference between the SBM and the block minimization
method.
>>:
[indiscernible].
>> Kai-Wei Chang:
>>:
Yeah, we --
[inaudible].
>> Kai-Wei Chang:
>>:
Yeah, that's true.
Yeah, but in learning on linear SBM, right?
Yes.
>> Kai-Wei Chang: Yeah, in learning on linear SBM, they already have some
similar tick increase called [indiscernible], and yeah, it's kind of different,
but these two methods are very related, yes.
>>: So the stuff that is in the cached set, does it ever go out of the cached
set, or is it going to remain there?
>> Kai-Wei Chang: No, it will go out. We will put -- you mean if the -- we
will remove that sample from the cached set. Yeah, we will remove it.
>>:
[inaudible].
>> Kai-Wei Chang: Yeah, yeah. So basically, after each iteration, we use our
model to choose all the sample from all of the part. Yeah. So given the
initial model and at the beginning the cache is empty, and then we do these
[indiscernible] converge and at each time load a block of data and train the
23
model on both the block of data and the cache, and then we update our model and
update our cache based on the current model.
So again, the sub-problem can be involved by any bound-constrained method. And
here, we use LIBLINEAR, which implement a coordinate descent method. And then
we can prove that the convergence hold no matter how you select those cache
sample. That is, you can even just do some random slashing of those cache
sample and the method still get converge.
>>:
[inaudible] can't you drag the [indiscernible]?
>> Kai-Wei Chang:
Sorry?
>>: If you choose the cache adversarial way, can you prove the convergence if
the cache is chosen by [indiscernible].
>> Kai-Wei Chang: Yeah, we can prove that, because you can consider the caches
is actually not very -- I mean, the cache is not -- the cache is you're just
trying to -- it's kind of similar that you do in the variable selection in the
nonlinear SBM.
>>: Can't you [indiscernible] those samples by margin?
Would it affect the current weights? Is that ->> Kai-Wei Chang:
It doesn't matter, actually.
>>:
That seems --
>>:
Yeah, that seems like --
>> Kai-Wei Chang: No, sorry. That's because when you choose the data, it's
kind of like you're choosing the variable to update. So you can -- so because
this is the coordinate is a method, so you can -- the choose of variable is
just [indiscernible]. So you can choose a different variable to update. But
in the end, at least you can if you update all the variable, if you can go
through all the variables to do the update, then you will get converge.
>>: Actually changes the distribution that you see. So I can decide to
take -- I can take some outliers, and just put them in the cache and free the
cache and I have it. Then you [indiscernible] you always want to see these out
24
layers.
Therefore, you'll see the distribution that you'll see will be skewed.
>> Kai-Wei Chang: No, because you update. So in that situation, the
[indiscernible] will be -- if you just put it in again and again, then the
[indiscernible] will get fixed and not update. So then you will ->>:
[inaudible].
>>: So I see the main confusion is that this is not an online algorithm. It
will go over the data many directions. So because they keep the
[indiscernible], so eventually they will figure out those examples are not
important so then they will assign that zero weight. So basically, the
dispersion doesn't change if they add zero weight.
>>:
[inaudible].
>>: Because the best [indiscernible] associate to [indiscernible]. So you
will figure out that alpha is close to zero. So basically, even though they
are there, you would not update your weight.
>>:
[indiscernible].
>> Kai-Wei Chang:
>>:
You won't update your weight, but the update will --
Sorry, but the update will be zero.
>> Kai-Wei Chang: So regarding the cache, you can actually define this alpha
function and just keep a sample with the higher alpha. And here we define the
cache functions using the distance between the sample and the margin.
So here are some implementation issues. The one is that because of we want to
remove some unimportant sample from the cache, and this can be done using a way
that you just copy and paste the sample to another memory and copy back. But
then you will need some extra memory.
But in the paper, we propose a way that you don't need to use any actual memory
to perform these operation. And another issue, that if the number of instance
is too large, then the alphas will also be very large, and then you can not
afford to store all the alpha in memory. But in this situation, we can use a
sparse implementation of the alpha and store the alpha that corresponds to the
25
data not in memory to some hash table or in the disk so that we can deal with
the larger dataset.
Okay. So the method can also be extended to solve several other linear model.
And let me briefly discuss the relationship between this model and others'
model. So here is a selective sample in algorithm proposed in early 2000. The
way they're doing this is that first, just select a subset of data and train
the model on that subset of data. Then they get a model, then they go through
the entire dataset and to choose the data that's close to the current margin
and put that into a memory. And then they use this data to update a model
again. And it already doing this process.
So this method is very related to our method. However, in SBM, we load the
data into memory, not only to select the data, but also using it for training.
So that we can avoid the overhead of selecting sample.
And also, this method is kind of heuristic, but our method can be proved to
converge.
And another relationship is between this method and online learning method.
And online learning method is a popular way to deal with large data. And but
in online learning algorithm, they usually only perform one update on a single
instance. So because the update is simple, so that they may need large number
of iteration.
And here are some online learning algorithm related to SBM. For, example, like
Pegasos can also update on blocks of data. But as I said before, Pegasos can
only do one update on that block. Otherwise, the convergence will not be
guaranteed.
Also, there are also some methods that are trying to use this cache heuristic.
For example, in this method, when they're trying to solve the [indiscernible]
perception, they also do some caching. Again, they don't have the guarantee in
converge.
So let me show you some experiment results. So in our paper, we show the
experiment on these two big binary classification dataset, and another one on
the multiclass classification dataset. And here we use a machine with two
gigabyte memory.
26
So we compared these five method. So the first one is proposed SBM method.
And the second one is for, just for some interest that we just do some random
cache. And we do not select the cache here.
The third one is block minimization, solving the dual form that we showed in
our first talk. And we also compared to another two method that only using one
sample at a time. One is the block minimization method with Pegasos. And the
another one is VW.
So again, here we show the convergence to the optimal. So the Y axis is
relative function value difference to the optimal. And X axis is the time.
You can see even that you do a random cache, our method converge faster than
the previous block minimization method.
And if you select an opposing sample, then the convergence is much faster.
And in terms of accuracy, we showed in the Y axis, it's the difference to the
best accuracy we can get on this dataset. So this is VW and this is Pegasos.
So those methods, only update on one sample. And if you update on a block of
sample, you can get an accurate model more faster, faster. And if you use our
method, then you can even get almost accurate amount as a first iteration.
So there is the result that we're trying to analyze the cache size and the
number of inner-iterations to the convergence. So the Y axis here, again
selective function variable difference and X axis here is the time. And the
different lines show the result with different size of cache.
So this one is using no cache. So it is basically a block minimization method.
And this one is using a very small cache that's only ten megabyte, 100 megabyte
and one giga and two giga.
So here, you can observe that using a big cache usually gets better results.
But if you use a cache too large, then the methods start to catch some
non-important data in the cache so you converge -- so the result is a little
bit slower.
And here is a result using three inner-iterations. This goes [indiscernible]
inner-iteration. And you can see that if you [indiscernible] too tight, you'll
also converge, the convergence also becomes a little bit slow.
27
>>:
Does it correspond to like one gig?
How many examples is that?
>> Kai-Wei Chang: Yeah, the example is that 280,000. And also, in this data,
we show that one gig is better because in this dataset, the -- the size of the
[indiscernible] factor is about one giga. So that if you choose the size of a
cache it's close to the number -- the size of the sub factor, then you will get
better result. But, of course, you cannot know that before you turn in your
model.
Yeah, and we also used our method in a streaming situation. So for some
dataset, if the data is too large, then it's difficult to process several
times. So people trying to treat it as a stream can only do one run on the
data.
And existing methods, online learning method you should update on one simple
point at the time. And we are arguing in this situation, the single pass of a
sample is typically not enough.
However if we apply the SBM here, we can get better result. The advantage is
that, first, because SBM can specify the size of a cache and also specify the
time span in solving the inner-iteration. So you can depend on your need to
define the size of memory and learning time that you would want to take for
this method so that we can fully utilize its valuable resources.
And second, we're arguing the performance of this method is more stable because
you're updating not only on one sample, but updating on a set of samples.
So here, we show the time that only go through the dataset once and we assume
that data is stored in the ASCII format. So the I/O time is the time to load
data from an ASCII file. And the learning time is the time to learn the model
using the data in memory. And the total time, basically just dimension of the
first two.
And because VW has a very good implementation they use to load the data into
learning at the same time. So there's no way to separate I/O time, running
time so we only show the total time here. And we get a reference model by
running the LIBLINEAR on that data several times until it gets converged and
you get a final accuracy about 99.55.
So for these two methods, only take one update on one instance at a time.
So
28
the learning time's very small. However, the I/O time is very large. So we
are arguing that you wasted time to do the I/O, but you did not utilize time to
do the linear. And they get a suboptimal result. And SBM, we took more time
to do the learning so we can get better result, even though it only load data
from disk only once.
Okay. So in conclusions, we propose an SBM model for training a large scale
linear classification. And this method can be extended to several other
formulations, Crammer and Singer formulation. And we release an experiment
code here and also we implement this method into a branch of LIBLINEAR, and can
download here.
So let me also talk a little bit about my current work in the UIUC. So here
I've been looking at the distributed training strategies for binary and
structural SVMs, and I'm also doing a -- I'm also doing an application on
co-reference resolution. So this is a co-reference resolution problem. That
giving you an article and trying to find those noun phrases that are driven to
the same entity.
So in here, the president Bill Clinton and his and the president are
co-referred. So we're trying to use some structured learning and latent
structured learning method to solve these tasks. So this is my talk. Thank
you very much.
>>:
You talked mostly about the SBM.
Regression problems, [inaudible].
>> Kai-Wei Chang: So recently, that my previous professor proposed a method to
use dual [indiscernible] method to solve the [indiscernible] regression. And I
haven't done this experiment, but I guess this method can also apply for ->>:
In the previous case, you won't be able to [inaudible].
>> Kai-Wei Chang:
Yeah.
>>:
It's not always hard to do that.
[inaudible].
>> Kai-Wei Chang:
>>:
Yeah, it's not.
It's not constant [inaudible].
Yeah.
29
>> Kai-Wei Chang: In regression if you use [indiscernible], there's also some
notion of a simple factor. Yeah. You also have a simple factor there.
>>:
[inaudible].
>> Kai-Wei Chang:
>>:
Oh, yeah, yeah.
I don't have the concept on.
[inaudible].
>>: If I understand correctly, you are basically using, only using
[indiscernible] to solve a dual problem. So basically, there is like an
outer-iterate which is a bigger [indiscernible] set. And inner-iterate is
smaller according to descent. So you can say you can prove converge. Do you
assume any number of inner-iteration to implement in order to guarantee the
convergence, or how accurate you need to solve for the inner problem in order
to guarantee the conversion.
>> Kai-Wei Chang: Yeah, so that before I say that, so the one is that you need
to -- you know you probably need to be stopped at some point. If you can
guarantee that -- I mean, if you can guarantee that if the inner
[indiscernible] can be stopped for any number of iteration, then basically you
can prove the convergence.
The idea that is you cannot [indiscernible] problem forever. Otherwise, you
will not see other sample in other block. So if you can guarantee that for
given number of maximized iteration this [indiscernible] stop, then you can
prove the convergence.
>>: So if you have maximal number of iteration for inner problems, so can you
solve the inner problem exactly after least number of iteration?
>> Kai-Wei Chang:
>>:
No, no, I'm not claiming that.
So you need to get an approximate?
>>: You need to get an approximate [indiscernible] and you say I stopped here,
because I need to look at other sample.
>>: As long as you can solve the inner problem accurately enough, you can
guarantee conversion?
30
>> Kai-Wei Chang:
Yeah, yeah.
>>: So it's just a guarantee of the conversion.
conversion rate?
You don't know anything about
>> Kai-Wei Chang: Oh, no. The convergence rate will also be linear.
symptotic convergence rate ->>:
The
Linear convergence?
>> Kai-Wei Chang:
Yeah, linear convergence.
>>: The linear convergence rate, though, you need to solve the sub-problems
exactly, right?
>> Kai-Wei Chang: No you don't need to do that. So okay. So you can consider
that if the data is all in memory, then you can use that [indiscernible] method
and this method has been proven to converge linearly to the optimal.
>>:
Which paper?
>> Kai-Wei Chang:
The ISML paper we had in 2008.
>>: [indiscernible] rate, right?
basically?
>> Kai-Wei Chang:
You use the [indiscernible] analysis,
Huh?
>>: You've seen after a certain number of iterations a landing site of all of
them?
>> Kai-Wei Chang: Yeah, it is a symptotic convergence rate, not a
[indiscernible], yeah. In our ISML 2008 paper.
>>:
[inaudible].
>> Kai-Wei Chang:
>>:
No, I mean, they say that --
[inaudible] optimization standard.
That's the standard.
31
>>:
Here, the blocks are like act together since you have to cache?
>> Kai-Wei Chang:
Yeah.
>>: It could be the data and this cache will be updated next time.
different from the traditional block calling?
>>:
So it's
Basically, not overlap on the block.
>> Kai-Wei Chang: So, for instance [indiscernible], you do the parameter, you
do the [indiscernible]. So you can select [indiscernible] update. And in that
situation, you also may be choose the same variable several times during your
update. And you can become [indiscernible]. So he's just saying here you just
consider a [indiscernible] for a variable [indiscernible] to select a variable
from this part of data so the convergence can occur.
That is the big idea.
If you want to see the details, you can see our paper.
>>: From shrinking, the idea is you end up in the shrink set.
have the shrink set, you go on ->> Kai-Wei Chang:
Yeah.
>>: So do you have something similar for your cache?
you continuously load the second part as well?
>> Kai-Wei Chang:
And you once
From what I understood,
Yeah, yeah.
>>: Do you stop doing that at some point in time and just work on the cache,
or do you continue, on a reiteration, you will load some block from disk on to
the second part and then yule update your cache and go on with it?
>> Kai-Wei Chang:
>>:
Yes, it's the second one, yes.
So it's not exactly [indiscernible].
>> Kai-Wei Chang: Yeah. So the thing is that so, actually, at first time, I
wanted to do shrinking in this scenario. Then it's very hard to do that. And
I come out this [indiscernible], basically it's doing the shrinking on another
32
way. So shrinking is basically remove the data from consideration.
add the data into consideration. Yeah.
Here, we
>>: So I had a question about one of the graphs that you had back earlier
showing like the progress of the SVM and it looks like based on if you were to
look at time, the distance between your optimal solution is monotonically
decreasing. Is that true? And if so, can you explain why it monotonically
decreases? Because if you have like what they were talking about earlier
with -- this slide. It looks like it's monotonically decreases.
>> Kai-Wei Chang:
>>:
Uh-huh.
What if you select bad examples in your cache at first?
>> Kai-Wei Chang: So here, I'm showing the dual [indiscernible] function
value. So this is proven that the result will monotonically decreasing.
Because at each time, if you consider that yeah, usually we're solving the
problem and the problem is the minimal, the problem. So the T equal to TL is
one solution. And we find the meaning of this sub-problem so you always reduce
a dual [indiscernible] function.
So in the case that if you use [indiscernible] example in the cache, then if
it's selected, then the alpha will become a fixed number. So you will not
affect the opportunity function any more after a few iteration.
>> Ming-Wei Chang:
Thank you.
Download