>> Dengyong Zhou: Hi. It's my great pleasure... assistant professor at the Department of Computer Science and Engineering,...

advertisement
>> Dengyong Zhou: Hi. It's my great pleasure to have Professor Le Song. Le is currently an
assistant professor at the Department of Computer Science and Engineering, School of
Computing, Georgia Institute of Technology. Le has a PhD in Computer Science from the
University of Sydney in 2008 and then he conducted his postdoc research at School of
Computer Science, Carnegie Mellon University from 2008 to 2011. His basic interests include
nonparametric kernel methods, probabilistic graphical models and dynamics of networked
processes and also its applications. He received an NSF CAREER Award in 2014. Please.
>> Le Song: Thanks everyone for coming. I'm going to talk about kernel methods and how to
make it scalable to millions of data points or even tens of millions of data points. So the
motivation for this work is we try to catch up deep [indiscernible] with the kernel methods. The
kind of motivating problem is this gigantic image classification problem. You have millions of
labeled examples and then the number of classes is also very big so 1000 classes is just a subset
of the classes. There's even more. And then what we want to do is take this image and then
predict these labels. The label can be just mushroom, things like this, and then what is it shows
here the label is predicted by some algorithms. You might predict a total five labels and then
some label has high probability. So you see this prediction is actually provided by neural nets
and it’s doing a pretty good job. Sometimes when the correct label is not in the top places but
it might be in a second place, for instance like this.
So the best performing algorithm I mentioned is the deep neural nets. So it's a very
complicated model. It has multiple layers but the layers are structured in such a way that you
see patterns. In the first five layers is this convolution pooling layer, highly structured process
unit I’m going to explain a little bit more. And then in the next two layers is going to be these
fully connected layers and then trying to perform some kind of nonlinear transformations. And
the last layer is going to be these multiclass logistic regressions.
So what is in this convolution max pooling layer? Its performance [indiscernible] very simple
operations. So essentially the convolution layer is just taking some convolution kernel that’s
different from the [indiscernible]. I'm going to talk about that. It's also a kernel. So we take
these small template and it has a number to it. Essentially you perform weighted sum over
your original image and move this window across the entire image you produce another image
and then very often neural nets you also apply this operation max after you finish a
convolution. That’s called rectified linear units. You also perform nonlinear threshold type of
thing.
After you perform this convolution, so essentially the first five layers perform these type of
operations. You come off the image with some templates and you perform that nonlinear
operation and also do max pooling. I'm going to explain. So the max pulling is something even
simpler. It's essentially reducing the resolution image and you just take a small patch and look
at the value in the small patch and you just return the maximum value in that small patch.
That’s the operation. But that operation typically you would use a reduce [indiscernible] image.
You interleave between this convolution and max pooling for many layers, five layers, and it’s
the best performing model. And then after that you have two layers of these fully connected
layers. So it is just taking the input image of vector and perform the weighted combination of
everything in that image and then produce an output pixel. So you also apply these in max
[indiscernible] linear units, this operation. You're doing it twice in a hierarchical fashion. It's a
highly mysterious way. And then afterwards you put logistic regression on top of it. Essentially
you try to learn this weight W so you do a weighted combination of the input vectors and then
you push this through the exponential function that gives you something related to the
probability of predicting label Y. So this is the model.
>>: So can you explain to us again this max pooling operation?
>> Le Song: So max pooling operation is just take a small patch in your original image and then
just return the maximum value in that small patch and the theory in as a new pixel.
>>: And so you are saying that I'm doing convolution and then max pooling?
>> Le Song: Yeah. You interleave between them.
>>: So when you say interleave you mean that one layer is doing>> Le Song: So max pooling together [indiscernible]. So one layer in the convolution and the
max pooling convolution [indiscernible] follow that max pooling. Sometimes they may skip this
step by model selection or whatever. I don't know why. So it's such a high structure model
doing lots of maxs and then in the end you push it so this multiclass logistic regression it's highly
structured. Different layer has different function but the goal of this model is to do this gigantic
nonlinear classification task. This is just a detail about the model, the parameter. So when you
have a convolution you need to specify these filters, kernels, this is the way to sort of specify
those parameters using convolution max pooling, things like this.
You have this original image which is 224 by 20 and 24 and three color channels. So the kernel
size is going to be 7 by 7. So you slide across this image and you might jump rather than sliding
window but you jump 2 and then you use 96 of them you get an image of 110 by 110 and 96
after the first layer of convolution, and you do max pooling, you reduce the spatial resolution,
you get something like 55 by 55 by pooling in a 3 by 3 window. So that's how they specified this
model. And then you do convolution again, 5 by five window and [indiscernible] 2 and you use
256 and then you get a smaller image. You do max pooling convolution again, convolution,
convolution and here you skip this max pooling operation and here is another convolution and
max pooling and pushes through fully connected layer here and fully connected layer and then
multiclass logistic regression.
So if you look at this very complicated model the convolution actually consists of five layers but
the total number of parameter in the convolution layer is only 3.7 million parameters. The
majority of the parameter actually lies in the last three layers, this fully connected layer and the
multiclass logistic regression and it's 58 million parameters. So essentially, given this gigantic
image data set you want to learn all these 60 million parameters. That’s the problem.
>>: Do you happen to know how they came up with this rich goal of numbers?
>> Le Song: So actually they have a nice package you can just specify the configuration for
these layers.
>>: How do they come up with 7 by 7 and not>> Le Song: I think model selection over 10 years or maybe 20 years. That’s my impression.
You can experiment with this configuration by specifying a small configuration file. They can
just search for this architecture. But this convolution makes sense in some sense [indiscernible]
image feature. I don't quite understand the fully connected layer. I understand these
multiclass logistic regressions. So we all try to understand when we put all these things
together what's going to happen. Why is it doing so well? So we try to understand. Yeah.
>>: [inaudible] 96 that’s the number of filters you use?
>> Le Song: Number of filters. 96 filters. That's why you get some image like 96 colors.
>>: Basically you reduce some image but you increase the number of filters?
>> Le Song: That's right. So when you do convolution for instance in this layer, 5 by 5 filter, it
can [indiscernible] across different color channels. So some [indiscernible] look across 96
different color channels. So it's something like this. So they managed to get this working so
nicely.
When you try to learn these parameters the algorithm that has been used is very simple. The
driving force behind this is really this stochastic gradient descent. You take a small patch of
data and compute a gradient and then you use a [indiscernible] derivative and you get update
for all these parameters. You have a gigantic nonconvex optimization problem, you just take
the gradient of the object with respect to parameter and you use the train rule to update it.
But when you try to train it you also generate something called the virtual images. You're not
just using the original data set of 1.3 million data points you are actually generating some image
by randomly crop the image or mirror image something like this. Do some transformation
image; generate something like the original data set but slightly different.
So along the way the original data set is 1.3 million. You actually generate up to 100 million
data points and it takes a week in GPU to train this model. You get this performance if you
predict the top one it and use the top one label and then compare to the actual label and then
the error is like 42 percent. That’s the top one classification error.
>>: [inaudible]?
>> Le Song: So I'm explaining in this point many adjustments [indiscernible]. The way they do it
is they set this a learning rate to be some constant and then after some point they see these
test error [indiscernible] flattens them you drop the learning rate and then you have a sudden
decrease in the error and then after some point flattens you decrease until after three of these
adjustments you don't see any improvement anymore. So this is something that if you don't try
it you won't realize how they actually trade[phonetic].
>>: How many manually adjust?
>> Le Song: So you just watch this [indiscernible]. You can basically estimate the change I guess
if you want to do it automatically.
>>: What's the training error?
>> Le Song: Training error is going to be like 30 or 20. There's a huge over[indiscernible] there.
>>: And training error for [inaudible]?
>> Le Song: Top five. Nowadays the best-performing model can be below 10 percent. It's very
accurate. It's close to human judgment. So this is the state of this model, and actually for this
model it creates lots of unsolved questions as well. Of course it performs so well in many
applications: speech, image, the question is where is special kind of architecture? Maybe
there’s some special characteristic in the data which is particularly suitable for this type of
architecture. It's not clear what kind of assumption you can make about the data such that this
[indiscernible] is really good. So I haven't seen any principle theoretical work for this question.
And also with five layer convolution max pooling can we use other architecture operations to
extract these features? It's not clear whether does any alternative and simpler operation which
you can also use. Then of course why three layer fully connected nonlinear units? What you
need is nonlinear transformations. Why not four? And is there a way to do it in a shuttle
fashion? Can you use alternative nonlinear classifiers if you just wanted a nonlinear
classification?
So in this talk I'm going to essentially explore these two questions. I'm also very interested in
top two but now I don't have anything concrete for the top two. So I'm going to use kernel
method to try and explore these two questions a little bit. I'm going to try to replace these
three layers, fully connected nonlinear units and the multiclass logistic regression by kernel
method and see what I'm able to achieve a comparable result. If I'm able to do that maybe this
really is not necessary. You could just replace it by traditional nonlinear nonparametric method
and what's maybe really useful is this convolution max pooling layers.
So essentially the kernel method is this type of kernel, positive semi definite kernel. When you
have it’s essentially a function taking two arguments. If you have the data set fixed size M you
make this kernel matrix, matrix has to be positive semi definite. It’s this special type of kernel.
It's different from the smoothing kernel from statistics; it’s different from this convolution
kernel. It’s this type of function, kernel function. So essentially I'm going to estimate some
nonlinear classification functions. I'm trying to do multiclass nonlinear classification. I'm going
to restrict my classification function to be in this space of function, the space of function
spanned by these kernels. I fixed one argument. I’ll chase all the class of function using these
data points from a particular space.
So when I have this positive semi definite kernel the nice properties the kernel function itself is
going to lie in that space and then the inner product between two kernel functions is going to
give you this kernel value. And then if you have a function in that space and then you can value
that function by just performing in the products that space is a nice property of it. For instance,
some familiar kernel function upon nonlinear kernels, so this kernel is definitely not a
smoothing kernel family and then Gaussian kernel is also smoothing kernel in the statistics is
richer. So essentially for this particular kernel function if you have a function that space which
is a linear combination of this kernel you would get a function of this shape. For instance you
can represent highly nonlinear functions just by linear combination of kernel sits on each
individual data points.
So this is just a brief introduction of this type of kernel function and spatial function chased out
by these kernels. In [indiscernible] you can think about this kernel method as transforming the
data for instance. So here is a binary classification problem. You have negative class and
positive class. You want the nonlinear classification. This kernel function is going to transform
this data to a new space, potentially a three-dimensional space, in that space you try to find a
linear relationship in that space.
So many kernel methods have this intuition behind and many method kernel method can
actually formulate this optimization problem and try to find some function F in that space. You
minimize some kind of expected loss. L is lost function. You have some data X and Y. Y can be
the label generated from some distribution. You try to minimize this expected loss function and
subject to some constrained [indiscernible] function [indiscernible] is bounded. And the
equivalent formulation of this top optimization problem is you move the constraints to the
objective function and add this new is something like a regularization parameter. The beta and
mu is related. So you can choose with different loss function you get different algorithms.
Kernel logistic regression if you choose this loss function. You get [indiscernible] you get logistic
regression here.
So typically for kernel method you solve it in the [inaudible]. So essentially you use this socalled representative theorem. You have end data points. You replace the expected loss by
this empirical loss, and then it turns out the solution of the optimization problem is going to
have a form like this. It’s just weighted combination over the kernel function applied on the
training data points or they are optimizing over some function in function space could be
infinite dimensional. So if you have the representative theorem you can in terms of a
[indiscernible] that into the original optimization problem and then just optimize over this alpha
instead. Alpha is in some RM. It's not F. It’s just some final number of alpha.
So you solve this optimization problem, but the problem is if you observe carefully you have
this kind of double sum in object function. That means for each pair of data points you have to
evaluate this kernel function and that creates lot of the problems. Think about M as 1 million.
So if you want to evaluate parallelized kernel function, and essentially you need to fill in this
entrance in this big matrix, and then one million and one million cost this memory ten to the
twelfth, and the computation is generally M squared. D is the dimension of the original data.
So it's huge. So you have to come up with a way to scale this kernel method up.
There has been a lot of effort already, for instance, in some approach based on lower end
decomposition of this kernel matrix. So you take this kernel matrix, you don't compute its
entry, you have a way to incrementally approximate this kernel matrix by some lower end
vectors. So this method, I did a [indiscernible] method [indiscernible]. Usually is the
computational of this method is going to be linear in the data point, T squared into the rank you
choose, and D the dimensional data. And the storage is going to just MT. But if you look at the
generalization ability after you do this lower end approximation, plug into your optimization
algorithm and you get your function, you look at the generalization ability the best you can
prove without any further assumption is the generalization ability comparing the expected loss
produced by this best function in the family and the function produced by this lower end
decomposition and then the generalization ability is the difference is going to be one over
square root T, the rank, class one of square root number data points. If you want to get best
out of your data you have to match these two things essentially and that means that T has to be
the order of M. So you get M cubic kind of computation here and then you get M squared
memory consumption again.
So actually in practice you also find this is the problem. So you do some lower end
decomposition fixed to some small rank, you learn that the classifier always loses some
performance comparing to the case where your optimizer directly with respect to this full kind
of matrix. That's which you observed because in theory you need to match these two things
up.
Recently people are more looking into this so-called random feature approximation of the
kernel function. So essentially there's some interesting relationship between this kernel
function, positive semi definite kernel function and some random processes. Essentially if you
have this kind of function you can always express it as some kind of integral form. You have
some random variable omega, it follows on distribution P, and then this kernel function can be
written as some random function which is indexed by this omega, apply on your X and
[indiscernible] apply on X, pi you do this [indiscernible] product thing with respect to this
omega. So you can also write it as this expected form.
So for some of the kernel function you can find this distribution P, omega in closed form. You
can find this phi, omega in closed form but it can also go from this direction to that direction.
You can pick whatever nonlinear function phi, omega wherever you want it. You can pick some
distribution wanted and then you just define kernel this way. It’s to give you [indiscernible]
positive semi definite kernel. So you can go both directions, but the people have worked out
some closed [indiscernible] expressions for some unknown kernel functions.
For instance, for Gaussian [indiscernible] kernel the form is like this. Delta is the difference
between X and X, pi. The random function phi, omega is just going to be something cosine
omega transpose times X plus Tao and the omega follows Gaussian distribution and Tao follows
uniform distribution. Actually uniform distribution between I think zero to pi, something like
that. And if you have a non-kernel Laplacian distribution any family of translation kernel you
have some nice solution. The corresponding distribution of omega is going to be costly
distribution. If you have costly kernel then [indiscernible] distribution is Laplacian distribution.
There’s some nice relationship between them.
So comparing this type of random feature approximation approach to this lower matrix
approximation approach there is some advantage already. Essentially what you do is you draw
some random parameter omega, random feature parameter, and then you're going to
approximate this kernel function by the average of these random features. Instead of
expectation just draw this omega random from P, omega and approximate it by sign of simple
average. And computation is simpler. So choosing [indiscernible] for the moment suppose you
have D there. Essentially you just need to apply each one of this random feature function on
each one of your data points. So if you have T random features it’s going to be T times D for
each data point and you have M data point that’s the operation. So log data is because I'm
going to [indiscernible] some tricks for efficient matrix go back to modification and get log D
kind of scaling. The memory is still all of M times T so essentially you can also do this matrix
instead of using lower end matrix factorization you just compute. You apply this random
function directly on a data point you get a number for each one of these random features.
So again once you have this lower factor A you solve this two-dimensional problem. And again
if you want to prove something you will find that the generalization ability is going to be 1 over
squared 2T plus 1 over square root M. Again, you want to balance two things.
>>: I'm missing something on the computation part. Even if I just want to just to write this
approximate matrix I'm not getting the true [inaudible] including the formation. Since it says
size M by M then it’s 1 to M squared.
>> Le Song: Yeah. That's right. So you don't explicitly approximate and train this matrix. You
just keep this lower factor A and work with this lower end factor as if your data from
[indiscernible] dimensional space or T dimension. So you're going to work with this matrix.
>>: But my algorithm is going to access all the entries you need.
>> Le Song: So your algorithm is>>: So in some terms the position algorithm is going to stay M squared regardless of>> Le Song: So before you run the optimizer you do this preprocessing and that preprocessing
actually don't need to answer the entry in that matrix. It directly works with these data points.
The same thing for this [indiscernible]. You don't actually go through every entry in this matrix.
That's how you get M, D squared. Otherwise you can't know why M square operations. Just
use a few entries in this matrix to come up with an approximation. So in this case you directly
apply this random feature on your data points to get this [indiscernible]. So again you need to
match the two.
>>: So how tight is this [inaudible] bound? Imagine if your data line in lower dimension>> Le Song: So in that case you’re lucky. This lower end approximation will work really nicely
and in that case if you incorporate that lower end knowledge you might get better bounds. In
here I'm not making an assumption on data. If you want to do something fully nonparametric
that's the kind of bound you will get. With lower end assumption possibly you will get better
results, better theoretical guarantee. That’s the problem.
And then what I'm going to do is I'm going to look into some scalable algorithm which has been
applied in many other places, the stochastic gradient descent, but I'm going to add another
layer of randomness to this algorithm to make it scalable for this kernel case. So first I'm going
to show you is why traditional stochastic gradient dissent is no good for kernel method. I'm
going to directly optimize this function in the [indiscernible]. I'm going to use something called
functional gradient instead of gradient over some finite dimensional vector. It’s just a
generalization of that.
So essentially you protect this function by some epsilon in the direction with G and you look at
a change you can express that change in the product. And this thing in front of this G is going to
be your function gradient. For instance, if your function F, X is here you take a gradient with
respect to this function; then using the reproducing property in this space you can express the
function like this. Then we take the gradients like linear function here you get this term. If you
have square null in the [indiscernible] you get function gradient like this two times F. You think
about it as a vector that everything seems to be very natural.
So essentially for this expected loss if you take the gradient you apply train rule once, you take
the gradient of this loss function with respect to this effects and take gradient with respect to F
again you get this additional turn here and this comes from the square null. The expectation
can be exchanged with operation of taking the gradient but that's what you get. So essentially
for many of these kernel methods formulated as convex optimization you can take gradient, it
has form like this, and then what you can do is you take a subset of data points, you take one
data point is better case, but you can take a mini-batch and then update your function using the
stochastic gradient computed using individual data points.
So essentially in the end you'll find that your function is a weighted combination of the data
point you have seen so far and then the number of trains in this summation can be equal to the
number of data points. So if you apply some standard nodes for stochastic gradient descent or
mirror descent and then you will get this type of rate 1 over square root T. In this case the
number iteration is T match up with the number data point you see so you get this rate. But
the problem of this approach is you need to remember all these training points. If you want to
evaluate this function in a new test point you have to keep all training points. You plug in your
new test point to this function in the kernel value and then do a weighted sum of them. You
cannot throw away those training points in general.
>>: So is this T the same as the T before the [inaudible] points?
>> Le Song: I tried to make them the same. So this T you can think about it as the rank. It’s not
the same parameter but they’re comparable in some sense. So the P iteration here plays the
same role as rank in the previous slides.
>>: So I don't understand the comment you need to remember all points. You just need to
remember T points.
>> Le Song: That's right. You need to remember T points. So suppose you want to get one
over square root M generalization ability then you need to essentially remember M points. And
you need to remember the points you have seen so far.
>>: Where is function [inaudible] gradient which is [inaudible]?
>> Le Song: It's more like>>: Where do you use [inaudible]?
>> Le Song: So essentially I can actually plug in this into the [indiscernible] equation and then
you can actually derive this result. For instance, you plug in this F here into this R and then you
plus epsilon G and then you just look at the difference and then divide it by epsilon and you will
get this. So it's just like you can think about it just a vector. It's a vector.
>>: Where is G here? So for this one>> Le Song: G is just here but your directional derivative is just the part in front of G. It doesn't
go off the G.
>>: So for here that’s the part>> Le Song: Here it just gives you the final results. I didn't go to that step. I just gave you the
final result of taking this directional derivative.
>>: [inaudible]?
>> Le Song: Some G [indiscernible] but it doesn't matter. Usually for this simple function that
direction doesn't matter.
>>: I'm just trying to understand the [inaudible] the next slide. So you're saying I need to
choose T to be on the order of M in the generalization I want but then the number of points
that I need to remember is just the support. It's not going to be everything.
>> Le Song: For instance, for [indiscernible]. You might get sparse solution sum of alpha is zero
and you then can>>: If you choose the kernel correctly most of them would be zero.
>> Le Song: So for [indiscernible] dimension that might happen and then hopefully the number
of supporters orders of magnitude smaller than the actual data. For our original regression
points the solution is dense and for many of those functions the solution may be dense, and
also for support [indiscernible] dimension generally there’s guarantee that the number of the
vector is orders of magnitude smaller than the actual training points, especially for this high
nonlinear case you'll find that you actually need lots of support vectors. That happens typically
in practice.
>>: The number of support usually grows linearly.
>> Le Song: So if it's growing linearly then it’s that same order. Maybe the factor is small but
it's still linear. So this is the key idea: how do we deal with these [indiscernible] or these
training points? So you just make a second stochastic approximation. So your stochastic
gradient and your approximate gradient using data points now I know this duality between
kernel function and this random feature. I'm going to sample some random features. In this
particular case I just sample one but you can sample a mini-batch as well. You're going to
sample these random features and approximate this kernel function by this product of random
functions.
So the advantage of doing this is of course after you get this doubly stochastic gradient you just
move your function after your function you use this doubly stochastic gradient and what you
will find in the final function you get as a weighted combination of these random functions.
You know exactly what the form this function is and then you only need to plug-in your
tax[phonetic] X and the value of the tax[phonetic] X on this random function and do a weighted
combination. You will get your prediction. You don't need to remember all these points
anymore. The reason why is you just need to incorporate evaluation of this random function or
these trading points into this weight alpha.
So in the end this training end point maybe it's 1 million dimension but you just need to
evaluate in this random feature and if it's a single number it's incorporated into the single
number. So if you go to 1 million data points you just need to remember 1 million numbers
alpha. And then this random function you know the form of it you know which distribution
you’re sampling from where typically we sample, for instance, from some pseudorandom
number generator. If I know the seed I can also re-instantiate the random number. I don't
need to actually store this omega; I draw it from a distribution. I just remember the seed. Next
time I will need to use that particular random feature I’m going to redraw it using the same
seed.
So the algorithm is actually very simple. It just keeps updating, join some data points, join some
random features, and keep updating this alpha. So the algorithm can be summarized in one
slice. In the end the form of the function is going to be weighted combination of this random
function you have drawn. So essentially you’ll sample some data points, sample some random
feature using a particular seed corresponding to the iterations. Then you can re-instantiate this
random feature very easily in test time.
In the current iteration you already get a bunch of alpha, you already get a bunch of random
features. If you want to re-evaluate a new test point on this function F you need to pass in the
kernels to the alpha you get so the evaluation of this function is going to re-instantiate a
random feature then apply the random feature in this new test point and weighted by this
alpha you learned before and then you just cumulate this.
And because the [indiscernible] are a regularization parameter you see some modification of all
these other J for J small equal to I. You try to forget a little bit about the alpha you have before.
For the kernel alpha you're going to update you just use this doubly stochastic gradient.
>>: So that reminds me of summation generation process and signal processing [inaudible].
>> Le Song: So here it's doing more naive [indiscernible]. You're not doing it smartly in terms of
sampling this random feature. You’re just to doing it joining from some random distribution.
You may choose some omega maybe more smartly you’ll get better convergence but I don't
know how to do that.
>>: But then if you [inaudible] alpha I is very [inaudible] zero, meaning that the random
direction is orthogonal to the training function. Then maybe you can ignore that I. Is that
correct?
>> Le Song: It’s possible. We try to just explore the simplest version and this way we don’t do
any big bookkeeping and you just keep averaging all these random functions, but you can think
about the extension of this by joining these random features more intelligently, do
bookkeeping more intelligently. Potentially you can prove convergence of this algorithm.
>>: I'm sorry. I'm not following, but are you describing to us the original technique of the
random [inaudible] or is this a new thing?
>> Le Song: This is the new thing.
>>: It doesn’t make sense because to me it’s exactly what they said.
>> Le Song: What they did is they first generated this random feature and then they optimized
in the final dimensional space.
>>: You can do that optimization stochastically which is what people will do in practice. Is that
what you're saying?
>> Le Song: People don't do that in practice. They regenerate maybe just 1000 random
features. You’re optimizing just 1000 dimensional space and that's how people make it scalable
for these kernel methods. And then if they want to try more random features what they would
do is they would regenerate another maybe 10,000 features and re-optimizing it. So this one is
essentially generating this random feature on the fly.
>>: So you’re saying yours is more flexible?
>> Le Song: In some sense very flexible.
>>: How do you make sure from the same omega if you generate [inaudible] on-the-fly>> Le Song: There's only way to generate the sample is if you use programming sample of this
and then you can always supply the seed to the random generator. So that's why I need to
keep track of the seed to make sure that every time I sample exactly the right. The question is
if you do something like this is it going to converge? You have analysis for stochastic gradient
dissent. Now you added this second randomness to your stochastic gradient. The question is
whether it’s convergent to the same rate with the same rate or not. So we also have some
analysis whether this is convergent and essentially if you do this doubly stochastic gradient
you'll find that the rate of convergence is the same under certain conditions.
The next you actually need to remember this random things. What to actually remember in the
algorithm just is alphas. If you had 1 million data points you just remember 1 million numbers.
That's it.
>>: So I get confused. So you can apply this technique to linear kernel, right? So if I apply this
technique to a linear kernel [inaudible] features. So the process is [inaudible]. In that case
maybe it would be zero because if I do a sparse feature [inaudible] zero then the chance I get in
update would be very low, right?
>> Le Song: There’s no advantage for doing this linear features. But the mapping of this
[indiscernible] linear case exactly in your sample is these dimensions uniformly random.
Because if a linear kernel, you just do inner product. It's a sum over the product of one
dimension and then the second dimension for that [inaudible].
>>: I think the problem for the linear is linear feature is in a linear case it's not needed to
[inaudible] trick because if you update any feature does a partial derivative for any feature you
need to evaluate the [inaudible] anyway. [inaudible] every feature is correct.
>>: [inaudible] better technique [inaudible]. In some case maybe you need to sample a lot of
times [inaudible].
>> Le Song: For linear kernel I don't suggest using this approach. This is for nonlinear case.
>>: So the new feature iteration you pick a new feature and you take a single sum of a point. So
if you generate all new features basically you'll observe only certain points.
>> Le Song: You make these separate. So you can take a mini-batch of data points. The minibatch can be different from the main batch size of random features.
>>: So you just mentioned that you don’t recommend this for linear and use it for nonlinear,
but [inaudible] not really that different so for your original data you can just do like polynomial
[inaudible].
>> Le Song: Exactly. So polynomial [indiscernible] is already nonlinear. Actually this random
feature approximation for polynomial kernel recently [inaudible]. Exactly you do some kind of
random hashing and then here you can think about this omega is performing some random
hashing of your data. Again, you want to get the generalization bound. You would need to
know the number of random hashing will have to match the number of data points you see.
Otherwise, you're losing some performance without any assumption of distribution of data. So
the algorithm is just that simple.
I mentioned about this random feature evaluation. You get log D instead of D because, for
instance for this translation we’re in kernel you have this random feature, essentially you draw
some omega from some Gaussian distribution. You do an inner product between omega, X. If
you have a bunch of random features, you have many, many of these omega draw from
distribution and you want to evaluate essentially, you put this omega in the matrix. You
essentially want to perform the inner product between this W matrix and each one of your data
points. So omega is a random matrix. The column is going to be some random number drawn
from some distribution, for instance Gaussian distribution, and you want to perform this inner
product.
So there's some way to perform this inner product in an efficient way. So essentially this is
some technique called fast-forward. So you can approximate this matrix omega, W by several
highly structured matrix. So you can actually have a product H,G,pi H,B matrix. The B matrix is
going to be a diagonal matrix where the uniformly distributed, just entry minus 1, 1 on the
diagonal and then the probability of minus 1 and 1 are the same. And then actually it's hard on
the matrix; it’s very structured. If you're D is 2 is some point this, 4 is like this, so you don't
actually need to compute this matrix, but the nice thing about this matrix it allows you to do
very fast matrix vector products and then the pi is a random permutation matrix and then G is
just a diagonal for Gaussian. That actually is the Hadamard matrix again.
So essentially you don't actually need to draw these D times T random numbers from the
distribution. So what you can do is you just draw this diagonal, supposed T go to D, diagonal of
Gaussians and you use this very structured matrix to mimic a huge random Gaussian matrix.
You can do that. And then this matrix allows you to do fast matrix modification and you can
actually even probably guarantee that after you perform this approximation into the matrix
vector product you get pretty much the same results. So this is some result from random
matrix modification and you can actually use it here to speed up the evaluation of these
random features on the data points. So this is just some speed up you can do.
>>: So this is only for the [inaudible]?
>> Le Song: So translation invariant kernel for some other kernel for instance there's some
rotation invariant kernel. You can also derive the similar thing. It's not for every kernel. So it
has to be a kernel where you to draw this omega from some distribution. It may be some
Gaussian and then multiply by these input data points and apply some nonlinear function
afterwards and you will have these types of results.
And in the interest of the convergence we also provide do some analysis for convergence
algorithm. Essentially what we try to do is to find some function that we produce in kernel
Hilbert space. Without making this doubly stochastic gradient approximation the function we
get is going to be something like this: weighted sum of the kernels apply each remaining
training points but we have to make some approximation and then we get something like this.
So essentially, suppose you use this translation invariant kernel, then you're going to get a
weighted sum of sine, cosine functions.
These functions themselves may not actually be in the RKHS. You're actually using function in
another space approximate function RKHS and you try to show the conversion. In here we just
showed that when you have that function F,T plus 1X the function you obtain the T plus 1
iteration and evaluate some point X and compared to this F star function which is in the
[indiscernible] the best optimal function of value X the difference is going to be small. So
essentially you can decompose this difference into two terms, one, by introducing this single or
stochastic kernel machine just approximate with the stochastic gradient coming from the data.
So essentially we can do composite error in the terms but this decomposition is only
expectation, by the way.
One source of error is due to this random feature and then the second source of error is due to
this random data. So essentially for the analysis of it we're going to analyze these two terms
separately. So you turn down the second term you can just use standard analysis from the
mirror descent. You just generalize it to a function RKHS and you can get one over T kind of
convergence. For the first term we did something special to this problem. We used basically
the concentration in quality for marking our differences. Essentially if you compare this
function in RKHS edge and this function [indiscernible] RKHS the difference is this: the it’s the
sum of a bunch of [indiscernible] the function you obtain each iteration. So somehow this
sequence of these consists form marking our different [indiscernible] and you just apply some
concentration in quality. You also get 1 over T rate together you get this kind of rate, one over
T rate for the function, the squared difference of the function, and then you can also get a
generalization of B,T which is 1 over square root of T.
So essentially that's how you get an analysis working and essentially you get the best possible
generalization ability for this nonparametric estimation problem. And the algorithm is very
simple, the question is whether it works or not. So we tried this way>>: Could you go back one minute? So H is what you would get>> Le Song: Without applying this random feature.
>>: That is what the [inaudible] would it do?
>> Le Song: Actually it's not. Actually it's this guy which you don't apply an approximation to
this kernel.
>>: [inaudible] the actual kernel itself?
>> Le Song: Yeah. Actual kernel itself. [indiscernible] is applying stochastic gradients and using
random data and also the random features simultaneously. You never revisit those random
features. You just keep generating new ones.
>>: So the martingale is just doing the one by one?
>> Le Song: Yeah. The difference is [indiscernible] zero if you look at the difference between
the two. So V,I is the distance. Using one random feature you’re trying to approximate this
kernel. Conditioning on the previous randomness the expectation of this guy compared to this
guy is zero because this kernel function expectation, this random feature approximate
[indiscernible] expectation is equal to that. So conditional or previous randomness because we
have the previous function here this expectation is zero and essentially if you look at the
difference it’s the sum of a bunch of this.
>>: So why is everything squared there? Why is the difference between F,T plus>> Le Song: Instead of squared difference and then take the expectation for it. So this
decomposition is in terms of expectation. We just look at the squared difference between the
function value. It's easier to analyze. You look at the absolute difference as well, but we just
look at squared difference. And then the second part is just zero descent. And then if you have
some property for loss function then you get this kind of rate for the risk expected loss.
>>: And what is this [inaudible]? What is that thing over, this constant? [inaudible] Up on the
top?
>> Le Song: On top is primarily related to the kernel function. It's like the upper bound for the
kernel function. For instance, if you have R,B,F kernel and then there it’s using the reproducing
property. You can express the evaluation of a function RKHS inner product between H and K,X,
T and then you use [indiscernible]. That's how you get the [indiscernible].
So we're going to apply this algorithm to [indiscernible] training data set. You need to go to
maybe 10 million or even the more data points in order to get a state of our performance. So
we compared three models. One model we call jointly trained units is the original neural units
we're going to use stochastic gradient descent and mainly we adjust the learning rates three
times to get a best performance. Then we're going to have the second model called fixed
neural nets. It's basically [indiscernible] check. So we're going to take this convolution layer
learned by in the first model and fix it here. I'm just going to retrain again just this fully
connected nonlinear neural nets and classification. And in some sense this model is the closest
to our model. We are also going to reuse exactly the convolution layer learned by the neural
nets but we going to replace the top part by this kernel machine. So it seems that this
convolution pooling is really doing something amazing. So we're going to replace that and use
this doubly stochastic kernel machine.
>>: [inaudible] use a joint>> Le Song: It’s not a convex problem anymore. If you want to also optimize>>: It’s not like the first one is a convex problem either.
>> Le Song: That's right. Actually I will show you the result but at the moment I'm just focusing
on replacing this part by this kernel machine which you have a very nice guarantee and see
whether they perform well in practice or not. So this is the previous figure you have seen that’s
for training these deep neural nets and then if you trained as fixed neural nets you have already
learned this convolution layer. You just try to optimize the top of fully connected layer and
then somehow you don't get the best performance comparing to>>: Using all the same tricks of [inaudible]?
>> Le Song: This one is not. That's right. This one you have adjusted. Somehow at this point
even when you adjust it it’s not decreasing anymore. Somehow there's some co-adaptation of
the convolution layer and then the fully connected layer. It’s not like you’re training jointly you
get better model but if you fix one point and train the other one it seems that you're not
getting as good a result. Here you are still trying to optimize some kind of nonconvex object
[indiscernible]. We don't understand why this is the case, but if you use this doubly stochastic
kernel machine that's what you get. You have faster convergence and you just use 1 over T
type of learning rates. That's what you get.
>>: You have also this graph in terms of time?
>> Le Song: This is one week and the time is comparable to these neural nets.
>>: So the iteration time is comparable.
>> Le Song: We also implement our algorithm GPU’s file. It’s comparable. So actually most of
the time is spent on loading the data and then performing the transformation in convolution.
So all of this convolution only has three medium parameters. It actually performed the most of
the computation and then these [indiscernible] parameter later. It’s only taking a fraction of
the time.
>>: But you only have to go through it once?
>> Le Song: No, not once. For us we can also go through a data point many times because
every time you don't see exactly the same sample and then you actually make a random
cropping of this image and then run about mirror image in some transformations. In some
sense you never see the same data point.
>>: So what is the red line [inaudible]?
>> Le Song: So it's not improving anymore.
>>: How do you know? Did you go to the>> Le Song: So we got to some longer thing and it doesn't.
>>: So it stops at 10 times less data than the green line.
>> Le Song: Yeah.
>>: So if the green line takes a week and you are at the same speed>> Le Song: At this point it’s a week. The student is not patient enough to wait for.
>>: That you're using the results of the green line generating the convolution. So how much
does just your training>> Le Song: A few days. It takes a few days as well. So I don't remember the exact time but it
takes a few days. So most of the time is actually loading the data and doing all this convolution
because our method is something new the student is not sure whether this method will work or
not. He tried many different things, some constant learning rate, and then he didn't go for one
week with each one of these experiments.
>>: What’s the dimension of the inputs [inaudible]?
>> Le Song: The dimension of this input essentially this layer is like 6000 dimension.
>>: [inaudible] then maybe if you just use the traditional kernel learning just to train it it can
also be done probably in several days.
>> Le Song: Possibly. Hopefully someone can try it and that means that we really haven't tried
hard enough kernel. Actually in this case we need 10 million or more to get this data
[indiscernible]. We need to go to that kind of scale in order to get a state of our performance.
>>: [inaudible]?
>> Le Song: So the original image is one plus 3,000,000, but as I mentioned your generate this
virtual image, right? You take the original image, you crop it and flip it and then do some
transformation and you generate a slight variation of the original image.
>>: With so many samples you cannot do the traditional>> Le Song: It’s harder. Maybe you can wait longer or use a larger machine but for a typical
desktop it's difficult. So we also tried some other data set that the story are pretty much the
same. For [indiscernible] you also generate virtual image that’s why a data point can go up to
10 million. So this kernel machine converges much faster and then you get about the same
performance in this case. The data is simple. It's only 10 classes so we actually get the same
performance as neural nets so you can also look at even easier data set.
>>: So here you get the different feature from before which is the [inaudible] neural net
actually performed better than the [inaudible]. So could it be that, is the difference statistically
significant? Can you analyze something like that?
>> Le Song: It’s taking quite a long time. We just look at this test error and if we run several
times probably we can average a curve and it’s maybe not significantly different. The data set is
simpler and what I can say is they pretty much get the same performance. In this case I believe
we didn't adjust the running rates.
>>: What is loss function that uses the doubly stochastic gradient?
>> Le Song: It’s multi-class logistic regression. The same thing for neural nets and
[indiscernible] that's even simpler and you can actually also make some small transformation of
the image generally up to a million data points. In this case error method gets very small error,
below one percent. So for instance, I’ll slightly faster, but I cannot say statistically whether it is
significant or not. And this one is faster because the image is smaller and then the neural nets
is only actually two layers. So it's not error model is this [indiscernible] layers. This is for
ImageNet. For the simpler dataset you actually use fewer layers. This is two or three layers and
in the image it is also much smaller and this one is [indiscernible] a day, very fast.
You get similar type of results and we also have tried regression problem not just a
classification problem. In this case we tried something like trying to take some threedimensional structure, this molecule, and predict some property of this molecule. For instance,
how efficient is molecule converting sunlight to energy, things like this; and in order to do that
you had to first represent this kind of three-dimensional structure, some feature
representation. So people have already come up with some permutation environment because
in this case it’s a three-dimensional structure coordinates, you rotate it you get different
coordinates but it's the same molecule; you need some way to represent this thing. They came
up with something called the column matrix looking at the parallelized charge and divided by
the distance and you get this matrix permuted as your data points. It’s a regression problem.
The output is between the range of 0 to 12 and it’s just a regression problem. You don't use
this convolution pooling anymore. In this case it's really just fully connecting neural nets, so
three layers, and then we just do this kind of kernel [indiscernible] regression. So this is just
comparing the fully connected network to kernel machine.
So in this case we go through this 2.3 million one and look at basically the absolute error in a
test set; and in this case you see fewer points actually. The reason why this curve is so small we
have fewer points for the evaluation and then the method [indiscernible] feels like smaller.
>>: So have you tried to [inaudible] method to the traditional>> Le Song: In the paper yes we have. We compared two of these coordinated distances using
our data set where we can actually run the competitor. So in that case essentially what we
show is the convergence rate is about the same. We are not clearly better than another in
terms of rate of convergence and then the final classification accuracy; it’s just that our method
is more scalable allowing us to try out this type of data set. So what we find is this fully
connected layer and the neural nets is really not useful. You can replace it by other kernel
method, nonparametric method, you get about the same performance.
What seems to be really useful this convolution and pooling layers. So what we tried is exactly,
we're going to also learn two things together, we’re going to learn this kernel method classifier
and also several times adapting this convolution layer so just by [indiscernible] derivative. And
what you get is you can also get close to the performance neural nets. We just tried recently.
We get 46. For neural nets you get 42. There's still a gap but this is much better than running
[indiscernible]. Running [indiscernible] 99 point something percent error.
>>: So is this better than doing this separate [inaudible]?
>> Le Song: It’s about the same. Separate thing you already learned this in field test. That’s a
much easier problem. You just learn a classifier. Here [indiscernible] learn the filter then we
don't have any guarantee for this but somehow works. You don't need to use this fully
connected [indiscernible] linear units. You put kernel machine here, you adjust this, it
somehow works. We didn’t tune it extensively. We could possibly get even better results.
>>: So here you still have the convolution layers, you just learn them, you do the [inaudible]
from the kernel method.
>> Le Song: Right.
>>: So it's a fixed convolution [inaudible]?
>> Le Song: So it's like 44. We just tried it in a few months and that's the result. It can
improve. Essentially with this I'd like to summarize a little bit. The method that we used to
scale this kernel machine is really just this one slice instead of using the random batch of
training data you also use a random batch of random features simultaneously to approximate
this gradient and using this doubly stochastic gradient to update your function. In the end your
function is going to be a weighted sum of this randomness. That’s the same representation has
this random kitchen sink but the way we use this random feature is different. We generate this
random feature online and we never revisit them. But in practice you can revisit them. You can
plan between the two. So the advantage of this approach is you can actually use it to handle
streaming data. Suppose your data just keeps streaming in, you can just keep increasing a
random feature to accommodate this increasing complexity of the data.
Of course, there is also an interesting problem if some data or classes you will never see in the
future. You want to forget about it. How do you do it? Nicely such that you keep the overall
[indiscernible] still manageable. So it can be applied to force this convex objective function but
we also recently training for kernel PCA, for Principal Component Analysis. You solve some kind
of nonconvex problem, you try command maximize a convex function instead of minimizing it;
you can also use this stochastic gradient to do it and then it doesn't prevent you to use the
doubly stochastic gradient. You can actually also provide the guarantee for this convergence if
your initialization is close enough to the spacing of [indiscernible]. Gaussian process you want
to estimate a predictive mean in covariance and it can be set as some convex optimization
problem. You can also use this approach.
So memory-wise we just need to remember this alpha, one for each data point, so that's why
it's OT if you run T iterations. And in terms of estimating the function we get this rate; in terms
of the generalization ability we had 1 over square root T. That’s the best possible for this kind
of nonparametric method with [indiscernible] assumption. In practice it also works nicely.
That’s the most important thing in some sense.
So there are many other questions that haven't been resolved. At the moment I’m also looking
at these two questions on the top. So what is the maximum pooling and convolution is actually
doing? Why can't we use other architectures and to get about the same performance? Some
people have gone to the extreme that you're not just having five layer of this. You might have
15 layers or maybe 20 layers of this, very small convolution kernel, and you get even better
model. So it's amazing you can do this and you get some model across the human judgment.
And then if this architecture works there must be some characteristical data which is suitable
for that type of operations. If you apply this architecture to other domain, maybe language, it's
not going to work. There must be some characteristical data. What kind of characteristic is
there to make this architecture work? It's not clear. We are studying this. Hopefully we can
get some results in the future. If I come back I will give this talk. That's everything. Any further
questions?
>>: From the previous slide [inaudible] function [inaudible] perspective you get the same
[inaudible] as the single [inaudible]. So the question is [inaudible] do you observe more
variants?
>> Le Song: Yes. You have more variants. Essentially the variants add up in the bound.
>>: And also even from the theoretical result you got the same rate in terms of T but the
constant is much larger.
>> Le Song: Yeah. It’s going to be larger. You have two sorts of variants, one sort is from the
data and another sort is from the random feature. So it is convex optimization. In practice you
will try to grab as large a mini-batch as possible, as large batch of random features as possible,
as much as your memory can hold and do this. You don’t actually just use one data point. It's
actually interesting that it’s different for neural nets. Somehow in order to make these neural
nets work you have to use a mini-batch that is small enough, not too small or too large, have a
few hundred points or something. You want to see the best convergence for neural nets. It’s a
nonconvex problem. Somehow there's a best stochasticity[phonetic] that allows you to get a
better result but here it's a convex optimization. You really want to have as little randomness
as possible. So we will grab a large batch, mini-batch. Any other questions? I will be here for a
few days. If you guys are free I would love to talk you guys more.
Download