>> Dengyong Zhou: It's time to start. So we are... as our speaker today. Rong is a professor in the...

advertisement
>> Dengyong Zhou: It's time to start. So we are very fortunate to have Doctor Rong Jin
as our speaker today. Rong is a professor in the Department of Computer Science and
Engineering at Michigan State University. He is currently lead in the Seattle lab of
Alibaba. His research is focused on statistical machine learning and its application to big
data analysis. He has published over 200 technical papers. He received his Ph.D. in
Computer Science from Carnegie Mellon University in 2003, NSF Career Award in 2006
and the best student paper award from COLT in 2012. And today he will talk about
recovering the optimal solution by dual random projection. Actually this technique has
been successfully applied to Alibaba advertising systems.
>> Rong Jin: All right, so thanks. Thanks for the introduction. This actually is our last
year's code work and in fact we, to some surprise, actually implemented it as a part of
Alibaba's display ad systems and it seems to be working reasonably well. Okay, so just a
slight clarification: I used to be at Michigan State so I did actually full-time join Alibaba as
a leader of the Seattle lab. Okay, so the outline is first of all I really want to particularly
highlight some of the challenges in solving optimization problems particularly involving
high-dimensional data. And then, I also want to look at the property of the random
projection becoming the key component for our theory. And then, we look at our
particular algorithm which we call the Theoretical Projection Algorithm and how that can
be utilized as the means to solving the high-dimensional optimization problem. And then,
I will present our analysis and looking at different specs of the data as well as the
solution to see how the guarantee of recovery can be performed.
And we also have an external version of the algorithm which eventually allowed almost
perfect recovery instead of just recovery with certain error bounds. And I'm going to look
at experimental results based on the UCI data set as well as some of the highlights of
our study by utilization this algorithm in the online advertisement system which is now
deployed in Alibaba. So I'll conclude by ending the talk. So I think almost everybody
knows the idea of random projection. This is an old and very effective idea. So basically
the idea is that I have a very high dimension vector, say x. Actually I can map this high
dimension vector in relatively low dimension by modifying with a random matrix. In this
case we are really focusing on the Gaussian Random matrix. To a large degree you can
generalize to other cases of random matrix like a random algorithm which explains this.
So this is a very simple idea but it's really a very powerful idea because [inaudible]
supporting theoretical evidence for utilizing this random projection is JL lemma. So
basically the idea of JL lemma says that if you have a finite set of data points, x1 to xn,
then you actually can utilize a random projection with n with 1 over epsilon squared with
the size. With that amount of random projection you actually can well preserve the
distance between any two data points up to the error of epsilon. I think most of the
people are familiar with it so I don't need to further elaborate on the JL lemma points.
So the idea of JL lemma actually has been widely used in various tasks like classification
and clustering, regression and manifold learning. So this actually becomes the
foundation for utilizing or justifying of random projection in many of the learning tasks.
Okay, so now let's very much focus on classification or high-dimensional classifications
problems. This has actually become the thing of this talk. So consider you have a lot of
training examples – xi, yi – and each of xi is data points in a three-dimensional space.
And let's assume that these are very large numbers. Your goal is to learn the classifier, f,
which essentially maps any vector in the three-dimensional space to a discrete value
plus 1 or minus 1. The straightforward way would be let's just do, okay, learn the
classifier in original space; therefore, you specify your loss function – Here's l – and then
you basically solve the regularized empirical loss and that usually becomes a solution
that is a reasonable generalization error bound. So the solution of w is there that
minimizes the regularized empirical loss. And then, you can view the classifier just based
on the sign of x [inaudible] w star.
Now the only shortcoming with this simple approach is that as the dimension of w gets to
be very high then the efforts of solving this optimization problem is going to be very
challenging. Okay, now one way to address the challenges that arise from highdimensional data would be let's try to reduce the [inaudible] data in [inaudible]. And the
simplest way to do that dimension reduction would be through the random projection. So
here is the idea that actually has been utilized by many [inaudible]. So what you can do
is first generalize this random matrix or random Gaussian matrix, in our context. So A is
a random Gaussian matrix. So this random Gaussian matrix allows you to map the
dimension of vector xi into xi hat is only of m dimension. So the assumption would be
that m would be much smaller than d; therefore, by this transform you eventually map
very high-dimensional data points into very low-dimensional data points.
Now with this mapping now you have all the data points now living in m dimensional
space instead of d dimensional space. Now you can learn the classifier in this lowdimensional space; let's call it z. And this almost an identical optimization problem but
the only difference is that now you only deal with data points of m dimension instead of d
dimension. As a result, you can substantial reduce your computational cost in terms of
solving the optimization problem. Now let's assume that optimal solution z is essentially
the solution you get from this low dimension optimization problem. Then, your final
classifier would essentially just be based on z star. So for any of the data points x, you're
going to map them to a low-dimensional space [inaudible] with random matrix A. And
then after this transform you can just directly dot product with z star and the sign of this
dot product becomes your classification [inaudible].
In other words you can also redefine your w as a dot product between A and z star, and
this w has essentially become your classifier in the d dimension space. So this is very
commonly used methodology to deal with high-dimensional data. Now the question is
how good is the w star? Can we have some way to categorize the performance of w
star? And that's actually become the starting point of this work. The very first question
you probably want to ask is, "When I learn this w star –" I'm sorry, "When I learn this w
hat from this random projection solution, we would like to categorize this classification
performance." Now this actually has been done by many people; I'll just cite one of the
recent papers. So in general the conclusion is that if the original data can be linearly
separable with a margin then you really have a good chance to get a good classifier by
using the random projection. Actually the statement is this: if the data set can be linearly
separable by a normalized margin gamma then if you have sufficient large number
random projection then with a high probability all the data points that can be linearly
separable by a slightly reduced margin.
The claim would be that if you find a classifier based on the original data points, if you
can find a very good classifier based on the original data points then you have with high
probability found a good classifier with respect to the reduced dimensionality. Now
apparently this statement is slightly [inaudible] because this only tells you the results with
respect to the training data, but essentially there's another statement that allows you to
generalize that to the generalization error bounds. But again everything was under this
big assumption. The guarantee that w hat will have a good performance only the
assumption that the data is linearly separable with a decent margin. So in the case that
the data is not linearly separable or if the data can be separable by a very small margin
then everything they promised over here is going to be falling apart. Therefore, it comes
to the question: what happens if the data is not linearly separable? What happens if the
data is separable but only with a very small or even zero margin? What kind of
performance are we going to expect if we apply random projection to the data and learn
the solution based on the random projected data?
For some reason there are little answers in that territory to my big surprise because it
appears to me that answer already appears in the old literature of delta functional
analysis. So this is actually the [inaudible] question we ask. In the case of the data
cannot be linearly separable or in the case of the data can be separable with a small
margin then what you're really looking for is to look at how well the solution w hat can
somehow approximate the ideal solution that you learn from the original data points
without any random projection. So the fundamental question would be is there any
chance this w hat could be a reasonable approximation of w star which learns from the
original data points? Yes?
>>: [Inaudible] approximation here do you mean similar classification accuracy or do
you mean...
>> Rong Jin: No, it's actually – It has to be similar vector because in the case of w star
only using – either has a lot of mistakes or it has a lot of data points that align exactly on
the boundary then you really want to assimilate w star rather than – You see what I'm
saying, right?
So it turns out the answer is absolutely no. In other words no matter how you design the
learning of the z star, there is just no chance you can have the w hat that eventually
becomes a good approximation of w star. And this actually is coming from the primary
results from delta functional analysis. So the statement is as follows: consider you have
E with your subspace in the space of d dimension. So E is a subspace in your d
dimensional space with a dimensionality of m. So the dimensionality of E is m. Let's
consider any random subspace of dimension m within the space of dimension d. So
what you can say is that if you fix any point x in this d dimensional space and then you're
looking at how far away this x point is from the random subspace E then the claim is that
in general this fixed data point x will be very far away from any random subspace E as
long as the dimensionality of E is small.
So a more accurate statement would be that the probability for this distance between the
fixed data point x to a random subspace E is smaller than -- this part roughly is an order
of 1 if m is substantially smaller than d. It's smaller than epsilon. And this chance actually
is very, very small; it's epsilon to the power of d minus m. For instance if we actually
fixed the epsilon to be one-half then you can roughly claim that with very large
probability that w hat to the w star – the difference between them measured by the
[inaudible] would actually largely equal then some constant multiplied with by the order
of 1 multiplied with the [inaudible] norm of the w star.
>>: [Inaudible].
>> Rong Jin: Right.
>>: So eventually you care about the [inaudible]. [Inaudible] to w star.
>> Rong Jin: Right. So maybe I should elaborate with points further. As I said the
original statement is all looking at the fact that you have a classifier then all the data
points essentially are far away from this boundary. That's how, actually, you can get
some guarantee. And the problem comes to if the data points are actually very close to
the boundary that's where the performance is getting very bad. So as a result what
happens is if you have this – Whoops. [Inaudible] Okay. So here's my w star. If actually
you have the w hat, it turns out the angle between these two is substantially large. Then
any points that are almost perpendicular to the w star would actually get very unclear
results. They could get a complete different prediction compared to the w star. I don't
know if I'm clear?
My point is that if you can see any x, right, if this quantity is very small then if you
[inaudible] w star by just a small angle then you can get very likely a very different sign.
As a result all the good [inaudible] performance you can promise for w star would be
falling apart if you you're getting w hat. I don't know if I make my point clear. And,
therefore, I really want to be sure that the difference between the w star and the w hat is
going to small enough in a way that I can handle it even for the data points that
essentially have a small dot product with w star. Yes?
>>: Just trying to understand your statement. So...
>> Rong Jin: I'm sorry, which statement? This statement?
>>: Yeah, yeah. This entire discussion.
>> Rong Jin: Sure, sure, sure.
>>: So say that the data is in a low-dimensional [inaudible] or low-dimensional
subspace. So your distribution I kind of – it's not using the entire space. In this case you
can think of many w stars that would be very different from each other that would have
the same [inaudible].
>> Rong Jin: Right.
>>: So it's not kind of – The statement that you are making is correct in the sense that if
I can find w hat which is close to w star then I'm guaranteed to have [inaudible].
>> Rong Jin: Right, right.
>>: But it's not necessarily the other way around.
>> Rong Jin: I agree. I agree. So I think this is a good clarification. [Inaudible] tried to
raise the point is that previous studies essentially look at the scenario: all the w, all the x
roughly have a large dot product with w star. As a result even if I have some angle – so
even if w hat [inaudible] some angle from the w star, you still do not suffer from large
[inaudible]. So all I tried to argue is that if we're looking at more challenging cases, that is
many of the x essentially have a small dot product with w star and those statements fall
apart. Apparently there are others ways of addressing this question, but one more
ambitious way to address the question would be if I can come up with a way to make a
good approximation of w star by utilizing random projection then I almost can avoid this
kind of mistake. So I do actually take your points. I think this is a legitimate right
statement from you. I hope this issue would be settled down here? Okay?
So, nevertheless, what I try to come back to over here is that essentially random
projection from the viewpoint of recovery, it really does a very, very poor job. The only –
There is almost nothing you can do about it. This is because the geometry of the highdimensional space. In the high-dimensional space of course now there's a small corner
of subspace with low-dimensionality. We can almost guarantee that most of the points in
this high-dimensional space would be far away from this random subspace, and there's
nothing you can do about it. And, therefore, from that viewpoint a random projectionbased approach is not going to work out. And actually this is what we can see in
practice. So in the case if the data is really, really easy, random projection uses a very
desirable outcome; however, as long as you come to the territory of typical classification
problems. Random projection is always falling apart just by using this simple idea.
So I hope that message will be...
>>: Can you just clarify the constants here? C, d and m are – d is the dimension, m is -?
>> Rong Jin: All right, so the d is the original dimension, a very high dimension. So the
E is the subspace with dimensionality of m. And this is a random subspace. So you pick
a random subspace from a high-dimensional space d, and the dimensionality of this
random space is m. So the point is that in generally any fixed points will be far away
from most of the random subspace. Is that clear? Okay. All right, so that actually
motivated us to ask the following question, that is: is there any way I can design an
algorithm to be such that the resulting solution w would actually be a good approximation
of the w star that you learn from the original data? Basically what we have is this w star
which you can learn – Our target is w star that you can learn from the original highdimensional space. What's convenient from the viewpoint of computation is to learn z
star that is from a very low-dimensional space. Any procedure that we can create to be
such that I can construct a solution w hat based on the z star that essentially gives us a
good approximation of w star. And that's the whole thing for this talk.
The idea actually is very simple. It's really based on a few simple observations. Just for
the sake of completion let me briefly talk about the primal problem which I think most
people are completely familiar with. Basically you have this original optimization problem
which stays in stays in the primal space and then, you essentially can create a new
version of the original problem by, for instance, writing the delta function into the form of
convex conjugate that eventually will turn this entire minimization problem in the primal
space into a maximization problem in the dual space. So here the variable alpha is the
dual variable with respect to the w which is the primal variable. So I think there's pretty
much no need of elaborating on those points.
Now again the relationship between the primal and the dual is very clear under the
[inaudible] norm regularization. So basically if I know the dual solution alpha star then
you can easily write down the primal solution w star. On the other hand if you know the
w star then you can easily write down the solution for the dual by taking the derivative of
delta function. So here, actually, I assume the delta function is the differentiable
[inaudible] just to make life easy. So let's look at the problem we have. What we have is
in terms of the original problem w we can connect the w with the dual. Well, we have the
same connection for the problem that you have applied to random projection.
So for random projection you would again have the same idea. You have a prime
solution now, z. This is in the subspace of Rm. And then also based on the z, you can
derive its dual variable which is alpha hat. So similar as before: if I know that alpha hat
star then I will be able to derive the z and vice versa. So those are common sense.
Based on these two observations the next most important observation we have is that it
turns out that there are two dual solutions: the alpha star and the alpha star hat is well
connected. So D1 is the dual problem for the original dimensionality. And D2 is the dual
problem after we perform the random projection. So you can see the only thing different
here essentially is between X transposed to X you're getting this immediate matrix AA
transposed.
So the connection between these two is based on this very simple observation. That is, if
you take the expectation of X transposed times A times A transposed times X, it turns
out this expectation would exactly give you X transposed times X. In other words, this
version, the D2 version of the dual problem is more or less an approximate version of
the original dual problem. So, therefore, there's a good chance that we can come up with
some kind of analysis that allows us to say [inaudible] condition then the optimal dual
solution for the random project data points would actually be a good approximately of the
dual solution for the original data points. And that actually becomes the key of this
algorithm.
So here's the key observation. What we can do is we actually can construct the w star
based on the dual solution alpha star. And we also can compute a z star and that route
is through the alpha star hat. And we also claim that essentially that alpha star and alpha
star hat could have a good approximate relationship. So if we piece this together then it
actually uses a simple way to construct approximation of the w based on the solution of
random project data points. Yes?
>>: Can you explain [inaudible] why is it easier to get the approximation of alpha than it
is to get the approximation of the original [inaudible]?
>> Rong Jin: So here I just basically stated with the intuition so...
>>: [Inaudible]
>> Rong Jin: The analysis is later. All I tried to say is that just by looking over here
because AA transposed essentially gives you a good – it's essentially [inaudible] of the
identity operator. Based on this observation we somehow can claim that D2 is an
approximated version of D1. And as a result we should expect the solution of D2 would
be a good approximation of the solution of D1, and that actually becomes out foundation.
I don't know if I make my points clear here. Okay, so the algorithm becomes – at least
the highlighted structures become very simple. So what we do is we first compute z star.
This is the optimal solution to the random project data points. And then, we can compute
alpha star hat. This is actually a dual variable that's based on z star. And as we claim
that this alpha hat star will be a good approximation of the alpha hat – So I somehow
can essentially replace the alpha star hat with the alpha star then I actually can construct
w star. So that's the logic we have. Going with z star derive the dual variable and use the
dual variable as approximate variable to construct the w which is the solution in the
prime space.
Okay? I guess this is just the algorithm with highlights over here. Essentially it's just
more details and an elaboration of the key steps I pointed out before. So let me just give
some of the analysis results we have with respect to this simple algorithm. First of all the
first case is this case that we assume that X is essentially a data matrix including all the
data points you have. So if X is actually a low-rank matrix then actually the results, you
can have a very strong guarantee. So the claim is if – r here is the rank of X. If r is
actually large enough which is roughly in the order of r log r then you can have with high
probability the solution we construct based on the dual random projection algorithm –
recall w hat – I'm sorry, w tilde which is based on the random projection and w star, this
two differences would be small in the sense that it would be smaller than roughly epsilon
times the norm of w star. And epsilon is on order 1 square root of m. Okay?
Roughly speaking what we said is if the [inaudible] random projection is larger than r log
r then you can claim that the procedure you get – the solution computed by the the dual
random project, w tilde minus w star, that difference should be roughly less equal than
order of square root r over m times the norm of the w star. Yes?
>>: So what would the [inaudible]?
>> Rong Jin: So there could be two possibilities: one is lots of features are redundant;
basically they are [inaudible] combination of the others. Second would be there are a lot
of data points and eventually they'll be linearly dependent. Either way, I think. But this is
creating a very, very strong restricted condition but this is actually the easiest scenario to
[inaudible]. In the case if the data matrix if of low-rank then you actually can generate a
small number of random projections and you can guarantee that the solution discovered
by dual random projection would be very close to w star. And the closeness essentially
is measured by one over square root of m. Now second of all let's look at slightly more
challenging cases. So what happens if the x is not low-rank matrix? What if x is of full
rank matrix? In this case we actually have to make an addition assumption because
without any assumption apparently there's no way you can claim the random projection
algorithm [inaudible]. In this case we actually introduce a variable called rho. This rho
essentially measures how well or how concentrated w star is in the subspace of x.
So the measure is as follows. [Inaudible] Here the U r bar: this actually is a subspace –
this matrix include the smallest d minus r left singular vectors. So basically here is this –
Then, by multiplying U r bar with the w star essentially I'm going to project this w star
vector into the subspace [inaudible] by the lowest singular vector of data matrix X. So
the claim is that – The matrix tells you how small that projection vector would be
compared to the original lens of the vector w star. The small rho implies that the higher
the w star is concentrated on the top Eigen subspace of x. So that's the quantity we
introduce. This is kind of [inaudible] that I think it's better just looking at a simplified
version – sorry, simplified version of the results.
So the results state that if actually the Eigenvalue is – Let me go here. Basically it says
that if the r plus 1 singular value of the data matrix X is small enough then also if the
number of random projections is large enough then you will have roughly the difference
between the solution constructed by dual random projection, w tilde, compared to the
optimal solution, w star. That difference will eventually be smaller than this quantity.
Okay? So as you can see this quantity has two components. One is essentially just
square root of r over m; this is identical to the previous case when we assumed the
matrix is of low-rank. Second is actually [inaudible] of rho. So the more concentrated w
star is in the top Eigen space of x then the smaller this quantity will be. So if the rho is
small enough then we still can claim that the difference between the [inaudible] vector
versus the optimal one would be small than 1 over square root of m [inaudible] over
[inaudible]. Okay? So this is the second result.
In the case that even the data matrix is a full rank, as long as the solution somehow is
reasonably concentrated on the subspace spanned by a top Eigen vector of data matrix
X then you still have good guarantee by running this procedure. Okay? So our last result
is related to sparse. So in the case if the optimal is sparse then you can still also get
some promise out of this analysis. Again, here we have to make an additional
assumption in addition to the sparse. So here we introduce – Sorry this an error. This
wouldn't be a bar here. Let me use the S to specify the support set of w star. As I said w
star we assume is a sparse vector supported by a small number of components. So S is
the support set of w star. What we introduce is the quantity eta here is the measure of
how concentrated the data matrix X is on those features that are being captured by w
star.
This eta is the original data matrix X transpose X minus Xs transpose times XX. This Xs
is the data matrix that only includes those features that appear in your optimal solution.
So this quantity essentially measures how well this data matrix is concentrated on those
features that appear in the final optimal solution w star. And now the expectation would
be the smaller this eta would be then the higher the concentration we have of the data
matrix on those useful features, and as a result you're getting a higher guarantee on the
results. So here are the results: in the case if this eta is small enough and then if the
number of random projection is large enough – By the way, the s is the [inaudible] of the
support set S. So the small s is [inaudible] of the support set capital S. So if the eta is
small enough and the number of random projection is large enough then you can still
claim that the difference between the w tilde and w star is going to be smaller than the
square root of small s over m.
So this roughly summarizes most of the analysis we have regarding this dual random
projection. I don't know if there are any questions about...?
>>: So back to your systems, what is this constant value?
>> Rong Jin: We have no idea, so we can only do the trial and error unless we measure
the entire Eigen space. In this case we – Unless we have, for instance, in this case we
know the w star then I cannot tell you in front what would be the eta.
>>: I have a question. Sorry.
>> Rong Jin: No, please go ahead.
>>: Again, we started out by looking at the kind of naïve random projection and you
argued that this will return a result which is far away from w star?
>> Rong Jin: Right.
>>: But at the end of the day we care only about the angle between them, and I'm just
trying to understand the difference between this technique and the naïve technique.
>> Rong Jin: Naïve in what sense?
>>: The one that doesn't work. The one which does a projection in the primal.
>> Rong Jin: Right.
>>: So there are given that the distance will be large. The large distance would be
attributed to a big angle or it could be just a different length.
>> Rong Jin: Right, right.
>>: So what we care about is the angle.
>>: Yes, in [inaudible] – I think this is slightly clear. In terms of [inaudible] classification
we do only care about the angle but if we're looking at the general [inaudible] function
then the magnitude of the w does – So if somebody gave you [inaudible] function then
the magnitude, the [inaudible] w does actually [inaudible] the outcome. I don't know if...
>>: [Inaudible] proved that indeed for the naïve primal result w star and w you get are
very, very far apart but the angle between them is tiny. So it's only...
>> Rong Jin: Oh, oh, I see. Okay...
>>: [Inaudible] the argument [inaudible]. I don't think this to be true; [inaudible].
>> Rong Jin: Okay. Yes, yes, so let's first of all be clear that – Maybe I should make the
statement more precise. So actually I particularly say that X is a vector which is the unit
lens. So actually I completely wipe out the size of the vector X in terms of distance. In
this case really the distance is captured in the angle. Therefore, here the statement
indeed relates to the angle as you are concerned about. So as a result the original
solution is [inaudible] – The first statement I really want to emphasize is that by utilizing
the random projection you are guaranteed to get a solution which spans a very large
angle from the desired solution you are looking for.
>>: So w hat here is not the solution to the [inaudible] problem; it's a normalized version
of that?
>> Rong Jin: Yes, yes. In this case it's really the angle. So we are looking at the fixed
lens, the fixed points with the unit lens and then looking at what would be the best
approximation in you subspace with respect to that. Yes?
>>: Is it possible – So you showed from your analysis [inaudible] using this dual method
you can minimize – you can get much smaller distance between w hat and w star. Can
we type back to the error?
>> Rong Jin: So you mean generalization error?
>>: Yeah.
>> Rong Jin: Yes, actually thank you very much for pointing that out. Our [inaudible]
theory did actually have generalization error but somehow here I keep on emphasizing
the recovery so I did not put that over here. I mean, my title is starting with the recovery
error so somehow it's not my main motivation of doing so. Yes?
>>: In the algorithm w tilde is a deterministic function of z star.
>> Rong Jin: W tilde is the deterministic function of z star.
>>: So in terms of like classification error, it can matter to classification of z star and w
tilde or equivalent.
>> Rong Jin: Right. Sorry, say it again.
>>: So because w tilde is the deterministic function...
>> Rong Jin: I think [inaudible] then I really don't want...
>>: ...of z star...
>> Rong Jin: ...to jump into – Yeah, go ahead.
>>: Yeah, because w tilde is determined by z star.
>> Rong Jin: Right.
>>: And if you only care about the classification error then you compare those two.
>> Rong Jin: Okay, okay. So I think the statement here is – I see the [inaudible]. I agree
with you [inaudible]. We have a z star and then we are getting w star and then going
from z star to w star is a deterministic function – I think this is slightly concerned with the
statement because this is under the assumption that you fix the random projection
because I do actually have to utilize random projection as part of reconstruction of w
tilde. Maybe I should really elaborate that.
>>: Yeah, we can fix the random projection and then the relation is deterministic. And
the classifier defined by z star and the classifier defined by w tilde are equivalent.
>> Rong Jin: What do you mean by equivalent? That's the statement I don't know. So
what I want to say, basically we build a transform. Take the z star and transform w tilde.
And all I’m trying to say is that this transform depends on the random projection you
have. It's not saying – That's the reason I'm really concerned about the statement you
say, deterministic, because when you say deterministic you somehow hint the transform
is independent from the random projection [inaudible] you make. So all I'm trying to say
is that the transform I build depends on the random projection that's introduced. And as
a result I'm not sure what exactly you're trying to say equivalent means?
>>: By equivalence I mean if you take any input x and you put it into the classifier
defined by a random projector and z star or a classifier defined by w tilde, do they
always output the same...
>> Rong Jin: I think this cannot be true; otherwise, why do we have deep learning? Why
do we have kernel learning? They're all based on the same set of input vector. If you're
statement is right then any algorithm would be equivalent as long as they're seeing the
same input.
>>: But he's only pointing to the dual parameter spaces you [inaudible].
>> Rong Jin: But z star is a very small parameter space, only of m dimension.
[Inaudible] is a very high dimension space. Think about deep learning, right? They take
this very low dimension space or kernel learning and they eventually generate a much
larger number of features and that capacity enables them to improve the classification
accuracy. So I'm not sure I can take the equivalent.
>>: So z star is obtained by projecting your [inaudible] data into a low-dimensional
space.
>> Rong Jin: Right.
>>: So it only contains the information of the low-dimensional space. And because w
tilde is a function of z star...
>> Rong Jin: I totally understand. But the dual variable they utilize depends on the
original data points. So after I get z star I actually generate the dual and the dual
depends on the original data points. Oh, sorry. That's not what I'm saying. The
reconstruction depends on the original data points.
>>: Okay.
>> Rong Jin: So that's the difference. Everything over here, I'm really working in the
subspace. But when I reconstruct w tilde, I actually take – basically I assume that this
alpha hat star is a good approximation of the dual variable. And I take that and I
basically just plug in the relationship between the primal and the dual. But here now I'm
looking at the entire data matrix.
>>: I see.
>> Rong Jin: So, sorry. I think I shouldn't skip that part but I looked at the time and it's
really limited. But this is an excellent question. I hope that clarification makes sense. So
indeed we do actually look at that part, the original data. And that makes a difference. So
if I work only within the random subspace then I should actually suffer from the same
guarantee or the pessimistic results that have been stated from a [inaudible] functional
analysis. But we did actually go one step further by utilizing the original data matrix as
part of the reconstruction procedure. Yes, please?
>>: How do you choose m? [Inaudible]
>> Rong Jin: Right. Good question. I actually don't have any good answer. I think all I'll
try to say – I agree with you. One thing we don't know is that we don't know really the
property of the data, the property of w star. If we know the property of w star probably I
already know w star. So...
>>: [Inaudible] small m [inaudible].
>> Rong Jin: Right. I agree. I think there's another step which is how to make it practical
or more effective rather than just scan through the entire m space. Maybe there is some
way that you can quickly lens the idea of the number of random projection which I really
don't know.
>>: But you use it in application?
>> Rong Jin: Right. We actually do the cross-validation.
>>: Okay.
>> Rong Jin: But it's in a naïve way so [inaudible]. It's not the most ideal way. I don't
know if there're any other questions. Yes?
>>: Did you try some faster Johnson and Lindenstrauss transformation? I mean all the
results on your slides are based on the Gaussian random projection.
>> Rong Jin: Right, right. Excellent, excellent. Yes.
>>: So some [inaudible]...
>> Rong Jin: This is actually an excellent question because in fact unfortunately I
cannot elaborate in more detail on the work I've done with [inaudible]. In those cases
actually doing random projection becomes the most heavy computational part. Actually
the rest would be a small story. So we did actually further – I think recently we further
proved that – One way to do that is by using the faster...
>>: Faster [inaudible].
>> Rong Jin: Right. But also there's another line of research which actually is coming
out of the [inaudible]. They're using actually sparse matrix for [inaudible]. We actually
explore that theory and we can show that in the case if the solution is sparse – Now it's
really looking at the sparse cases – then even with the sparse random projection things
work out. In practice that seems to be the case. But those are excellent questions.
>>: And I have another one, sorry. But, otherwise, for your results for the exact sparse
w star, I'm wondering for that model did you consider L1 utilization or just L2?
>> Rong Jin: Again, excellent question. In fact in the L2 then you really cannot get
exactly sparse. We do actually have the results for approximately sparse but I thought I
got too dense on the results. So I actually wiped it out. So we do have very similar
results. It's pretty much like the idea of this. We are actually getting additional [inaudible]
coming out of the approximate sparsity [inaudible]. In the case of the solution being very
well approximated by a sparse solution then I can claim that roughly the same holds for
the guarantee. I don't know if there are any other questions? Okay, so I know that I'm
pretty much running out of time. We did one more version which is that all the previous
results only give you an approximate solution, an approximation error is an order of 1
over square of m. And we actually further designed an algorithm which allows you to
reduce the error in a dramatic way.
Basically the logic is very simple. If I'm running this procedure by one step then roughly I
can reduce the error by a factor of epsilon. So it's very appealing of I actually can run
this procedure multiple times and each time I'm getting an epsilon reduction. Then I can
have this exponential reduction in errors. Indeed that's the case, right? So consider in
the second step we actually – instead of targeting the w star, I'm going to target on the
difference between the w star and the w tilde. And I assume that'll become our target for
approximation. Now if I apply a similar algorithm to approximate delta w then intuitively I
should get the same. So the guarantee which is delta w tilde, the approximate version of
delta w, the difference between them would be actually an order of epsilon times delta w.
Now if you construct the new solution by adding a w tilde to the delta w tilde then you
can easily see that this new solution will actually have epsilon square error.
Therefore, it's very easy to run this procedure multiple times getting a geometry
reduction in terms of approximation. Yes?
>>: Something about [inaudible]. So in order to do this [inaudible] random projection you
[inaudible] but you use data...
>> Rong Jin: To recover.
>>: To recover.
>> Rong Jin: Right.
>>: But you don't really know the [inaudible] between w star...
>> Rong Jin: Right, we don't know. We have no idea; w star. That's right.
>>: So how can you do this procedure?
>> Rong Jin: So all I'm saying is that if I can come up with a procedure in the way that in
the second step instead of targeting or approximating w star, what I want to target
[inaudible] is the delta w which is the difference between w star and w tilde. I know w
tilde, right?
>>: Right.
>> Rong Jin: So, therefore, I could actually construct an optimization problem in my
second round by somehow wiping out the w tilde. And then, eventually in the optimal
solution should be delta w. Right?
>>: So the second iteration?
>> Rong Jin: Right.
>>: You're talking about if your target would be the difference between, say, the original
prediction.
>> Rong Jin: The original solution.
>>: ...data set and the approximation using w tilde.
>> Rong Jin: That's right. So I construct the optimization in the way I make the delta w
to be as optimal. And then if I apply the same procedure I should get a guarantee which
is the solution delta w tilde would only have epsilon relative error compared to delta w.
Now if I combine the w tilde with the delta w together which is this new solution then it's
easy to argue that [inaudible] error you get is going to be an order of epsilon square. And
that's the logic we have. Do I make sense? Okay, then you have this average. The only
thing I really want to say here is that – I don't want to jump into the slide in detail
because I'm really running out of time. One of the key things I want to say here is that
even though I'm running this precision multiple iterations, I only do random projection
once. So it's not the case for every iteration I have restart with the random projection. As
I claimed before random projection is part of the most expensive components in terms of
computation. So what you can claim is that if you're running multiple iterations which is
called w tilde t then a difference between them would be essentially epsilon to the power
of t multiplied with the lens of w star.
So you do actually get the geometry convergency by running this procedure. And you
only do random projection once. So let me very quickly go through – Basically we first do
a synthetic data set. In this case we have basically constructed the data matrix X by
multiplying A and B together, both would be low-rank Gaussian random matrix. So here
is the comparison of the dual random projection algorithm to the naïve algorithm which is
the first one that I claimed. As you can see roughly the error between them is very large.
So we can see a very quick reduction in terms of recovery error. Well, you don't see
much reduction in terms of the naïve [inaudible] recovery errors. Running time – Okay,
so maybe that's not --. This not terribly interesting. Understand that we're actually
running on the text classification problem using the rcv1 binary data set, so the
dimension is roughly forty thousand-some features. And we used half as training, half as
testing. So the horizontal here is the number of random projection; the vertical here is
the accuracy. So we have three lines here. The top dotted line is the one using the
original data and then, the blue line is using the naïve approach based on just z star
directly. And then, the red line is the dual random projection. So as you can see dual
random projection is actually immediately getting much better results compared with the
simple random projection algorithm.
This again confirms that not only are we just getting a good recovery error, we do
actually have better results in terms of the [inaudible] error. So as we see here, with
respect to the testing data rather than the training data just to be clear. Okay, so let me
just very briefly say the work I've done with Alibaba. So interesting enough, this actually
not related to the classification; this is actually related to an interesting problem about
displaying advertisement. So in the difficult scenario of displaying ads basically
somebody was visiting the [inaudible] website. Essentially the website is published by
Alibaba. For each [inaudible] PV the system will have a way to categorize a profile in the
[inaudible] user, and based on profiling they will try to identify the subset of ads, the best
match with the interests of the user. And that's become the choice. As you can see this
is a very, very greedy algorithm. Anytime you come and visit, I always find out a subset
of ads that best match with the interests of the user. This is a very, very common
practice. Now this sounds like a very intuitive idea. There's only one caveat somehow
which is overlooked by many people; that is, the budget.
So for each advertiser they actually have a limited budget in terms of presenting their
ads. So as a result they only can present their ads to the user a relatively small number
of times. If you're taking this greedy algorithm, one of the potential problems that can
result is that there will be a very big mismatch between the supply and demand. So
when I say supply I mean the number of people that visit [inaudible] with a certain
interest. When I say demand I mean what type of interests that are specified by the
advertiser. So here's the [inaudible] actually I gave out here. The horizontal line here
actually is each different sort of interest group you can think of. And then, you may not
see very clearly but I think people who see clearly on this end, for each interest group
we have two bars. One bar is about the supply; that is, how many visiting people we
have with that particular interest. So this is let's say the red bar. And then we also have a
blue bar. The blue bar is the number of advertisers -- or number of potential visits that an
advertiser is allowed – they can budget for that particular interest group. Because they
only have a limited amount of budget; therefore, they're only allowed for a certain
number of display.
So as you can see actually we have a very large gap between the supply and demand.
There's a certain – So the interest or topic that a lot of people have, well, unfortunately
there's not much supply if we take the greedy algorithm and vice versa. Okay? I don't
know if I make my point clear. So all I'm trying to say is that essentially each advertiser
they specify what kind of people they'd like to target. But because of the budget, they
cannot show the ads forever. They only have a small number of times to show the ads.
Now if you always choose the best match to advertise then you really will run into this
big problem. The problem is that you have a huge mismatch between the supply and
demand. Yes? I see the hand over here.
>>: [Inaudible] in multiple...
>> Rong Jin: Yes, yes. So in most cases you like shoes, you like outdoor things – Yeah.
Each people will actually – At the same time each advertiser they also could target a
different set of people. It's not just one single laboring. But if we're taking this greedy
algorithm, these are the things we see. We see there's a very big mismatch between the
supply which is essentially the number of people with the same interest and then
demand which essentially is the number of advertisers that like to target [inaudible] the
people. And that's the problem they have. So I don't know if I – Unfortunately I cannot
give very accurate descriptions because somehow I was not allowed to give a very
detailed discussion. But I just wanted to give you this high-level problem that I hope
people could appreciate.
Okay, so the way that we actually target this problem would be instead of taking this
simple greedy algorithm which is very convenient in computation, can we actually do a
much bigger optimization problem? What we do is we really want to minimize the gap
between the supply and demand but actually pre-computing for each user. We would
like to pre-compute what would be the best subset of S that I can display to him? Now in
this case I'm really now just looking at whether the ads are the best fitting with respect to
his interest. We also want to take into account there's a budget constraint that's basically
for each advertiser they only have a relatively number of chances to display to the user.
So when we do this optimization, we really want to take into account not only just the
interest of the user but also look at the gap or the potential difference between supply
and demand. Therefore, that's become essentially an optimization problem.
We like to estimate a thing we call a matrix A and this is actually a user-ad assignment
matrix. The number of users we handle is around 10 to the power of 8. Number m we
handle is around the order of one million. And we like compute every day – At midnight
we like to compute this matrix A and this A will decide for each individual user which ad
he's going to see tomorrow. And that information – the A – will essentially become a
guidance of display for the next target advertisement. Again, as I said, I cannot give the
details so, therefore, I stated everything on a very high-level unfortunately. Basically you
have this optimization problem, so you have some [inaudible] function f A and then you
have this regularlizer. And you'd like to solve this optimization problem.
Now the only issue here is that the matrix is really, really large. And the disaster is that
we actually are only allowed to compute the matrix within one hour. Yes?
>>: So there's no details but there has to be some kind of separable format?
>> Rong Jin: Yes. So there's a lot of summation here. This is what I'm saying. I was
talking to people and they clearly indicate I cannot give any details. So I can only put f
here.
>>: In order to apply [inaudible].
>> Rong Jin: Right, exactly. You're absolutely right. This is actually is even more than
millions of a summation, but I cannot give anything more than just f. So I hope that you
can bear with that constraint. But, nevertheless, I think everybody can understand we
really have a peak variable to optimize. This A is 10 to the power of 8 times 10 to the
power of 6 matrix. Even though this is a sparse matrix for sure, but still it's a gigantic
monster we have to handle and even worse is we only have one hour to actually
complete the optimization. And that's why we applied the dual random projection to it.
We roughly applied the same idea except that now you really have two sides to handle:
not only just the user side but also the advertisement side to handle. Therefore, the
theory is slightly more complicated over here. Nevertheless, the same logic is actually
applied over here.
And it seems like it works very decently. Now it's actually online as a part of the
advertisement system. Okay, so I think I'm pretty much done with my talk. Sorry?
>>: So maybe just...
>> Rong Jin: Sure. Go ahead, please.
>>: That problem [inaudible].
>> Rong Jin: Oh, sorry. N is the number of the users and m is the number of
advertisements. So there is an order of million advertisements.
>>: I mean the projected dimension. So with this A you do that to probably vectorize it.
This is a huge vector.
>> Rong Jin: No, no actually we're not vectorizing. We still maintain it's a matrix.
>>: Okay.
>> Rong Jin: So basically we're just taking this larger matrix and projecting it to a small
matrix.
>>: And then, what's the size of the small matrix?
>> Rong Jin: I'm not sure I'm allowed to tell you that detail. So for this reason I keep
everything on a very high-level. So I cannot give you the details. But nevertheless, this is
real. This is actually working now in the system.
>>: Are you only assigning one [inaudible]...
>> Rong Jin: No, multiple. Multiple.
>>: Multiple?
>> Rong Jin: Yeah.
>>: And when the user arrives there's the...
>> Rong Jin: You look at the tables.
>>: ...[inaudible] of deciding.
>> Rong Jin: That's right. That's right. Again, let's keep all the details away. I'll just tell
you at least computation-wise one fundamental issue is how to compute A. Apparently in
practice there is the more practical issue of how to convert this real number matrix into
the [inaudible]. So that's another very practical issue. Yes, I agree.
>>: Sorry, I have one comment.
>> Rong Jin: Sure.
>>: It seems that the method you use is [inaudible] to select a random projection instead
of computing random you compute some kind of weight, smart weight [inaudible]...
>> Rong Jin: Okay, so let me be clear: the random projection is indeed random. The
matrix is generated independent of the problem instance we handled. These two are
completely independent.
>>: The w then you constructed actually depends on the data.
>> Rong Jin: That's right. Indeed, the w constructed depends on it. There's no way you
stay away from the data to get something good enough.
>>: So when do you random projection do you consider – I mean I understand A is a
huge...
>> Rong Jin: Right.
>>: So do you consider using an attribute of user?
>> Rong Jin: Oh, okay. Good question. I think that there a couple reasons [inaudible]. I
think there are more hints if we can use some meaningful features as the way to reduce
the damage [inaudible]. Right? Right, right. Very good point. Unfortunately in the case
we probably have 40 percent of people that have a very small amount of visiting
information to [inaudible]. There's like 50 of people or around that ballpark have regular
visits to this site. But there's a large amount of people who do not have enough of data
that you really can figure out there information, like demographic information, etcetera.
So you have to probably handle them – I think you're strategy I haven't tried out but it
may be able to handle those regular visitors. But for those [inaudible] you'd probably
have some trouble categorizing them well.
In our case we'd probably just do a [inaudible] optimization and forget about anything
else.
>>: Have you done any work extending this to non-linear methods?
>> Rong Jin: No. So we did actually try to look at the kernel using the connect
[inaudible] random Fourier and then some how -- But I don't think it's something
dramatic. I mean you're following [inaudible], I think you can get the [inaudible] error
bound in some way but it wasn't a particularly exciting way to do that. So we didn't think
this is a big enough work to really explore.
>>: It seems interesting to potentially do a random Fourier projection to enlarge the
space and then kind of use that as the space...
>> Rong Jin: We do – I think this might still improve the bound but I thought it was kind
of trivial extension with something completely naïve and in no way enough, so we didn't
seriously pursue that.
>>: What about kind of empirical, whether that could be a strong empirical method
versus current [inaudible]?
>> Rong Jin: Actually I don't know. So we just don't feel too excited about it [inaudible].
Maybe it's my bias so, therefore, we don't...
>>: From a theoretical standpoint?
>> Rong Jin: Yeah, and therefore we don't even...
>>: Or from an empirical standpoint?
>> Rong Jin: We don't even bother to check out the empirical. But maybe that's the right
thing to do at this point. But we didn't do anything around that. Yes?
>>: So with such high dimension, I imagine that your random project matrix will be
impractically very big. So...
>> Rong Jin: Right. The question was asked before, yeah, so go ahead.
>>: It's kind of like [inaudible] whether you have tried some of those, like, [inaudible],
random projection [inaudible].
>> Rong Jin: Right, so actually our approach is not – When I say we're actually looking
at the sparse random projection. There's another line of comparisons looking at sparse
random projection. We could expand it, right, in a graph theory.
>>: I have one question.
>> Rong Jin: Please?
>>: About the formulation [inaudible].
>> Rong Jin: Okay. Formulation?
>>: I don't know [inaudible]...
>> Rong Jin: It pretty much means nothing. I mean as long as they forbid me to give it...
>>: So why do A – My problem is here.
>> Rong Jin: Okay.
>>: If you have to solve A, that is a given. Your user and your ads you have to
compute...
>> Rong Jin: Right. Right, good point.
>>: [Inaudible].
>> Rong Jin: So – Let me be sure I only give you the information I'm allowed to give
you. So what I can say...
>>: Because I don't the exact properties that's why I ask about it.
>> Rong Jin: So I can give you this rough idea. Basically there are two types of users.
One is [inaudible] old user. The other one is a new user. So those new users do not
have any activities in [inaudible], so this is the first time appearing in [inaudible]. I think
apparently if we do this – So the computation of A only accounts for "old users." For any
new user, there's no way. I don't now if he's going to come to [inaudible]; therefore, how
would I figure out what kind of ads. But fortunately in terms of [inaudible], new user count
is a very, very small amount.
>>: What about new ads?
>> Rong Jin: So this is a good question. In general for most advertisers, they actually
specify their budget ahead of time. So basically they will tell you how much money
they're going to put out. They're allowed to change but the chance is not very large. So
most of the people will tell you how much I want to put on my advertisement one day
before they're actually displaying the ad. So you do actually have enough data to tell you
what kind of budget they have or additionally we can have a very reliable estimation
about it. So I agree. If things are completely open online, there's nothing you can do. We
can only do the online version of the things. Yeah, another question over here.
>>: Have you done any work on multi-class classification?
>> Rong Jin: No. I think we only do the binary. So those are good open questions. I
really hope to see some of the results [inaudible].
>>: So most of the things that you talk about are on analysis from high dimension to low
dimension.
>> Rong Jin: Exactly.
>>: Now I have seen many applications where you have low dimension or dense
features and you can't use random projection back into high dimension. I have seen
some application of that [inaudible]. Then there is this style of learning. So like it's a
kernel method [inaudible] project...
>> Rong Jin: I see. I see.
>>: ...random projection. Do you know any kind of analysis on this [inaudible]?
>> Rong Jin: Okay, so I think if you take a low dimension vector, and let's assume
everything will stay within the linear. If it's non-linear then it's not meaningful analysis
related to our work. So if you have low-dimensional data and you do this transform
mapping it to the high dimension, if you're only doing everything on the linear side I'm
not sure how much you can get. In general I don't feel you can get anything out of it. So
for instance, for kernel they are not exactly random. They're actually mapping it by
deterministic function which is essentially the basis function of the kernel [inaudible]
space.
>>: That's not a linear point.
>> Rong Jin: Right. Right. So in terms of linear I don't see any motivation of mapping to
high dimension.
>>: And I'll also mention I do have non-linear but you can do the linear mapping first and
then you apply...
>> Rong Jin: Oh, I see.
>>: ...the [inaudible] function and from there you can do features in high dimension. And
people actually use that to do some useful things.
>> Rong Jin: Yeah, that sounds very interesting except I really don't have an input
[inaudible]. Right, but I really don't anything about that. But it sounds very appealing
which I hope to see a good empirical [inaudible]. Okay, so any other questions. Oh,
[inaudible]?
>>: Hashing methods?
>> Rong Jin: Hashing. So there are two things...
>>: [Inaudible].
>> Rong Jin: Right, right. So good question. I think the difference would be two things:
one is that typically speaking you're actually looking at the binary coding of the vector.
And second is actually hashing really is looking to preserve the geometry about the data
points you have. So it does not have anything do with the learning procedure in my
opinion. I'm not sure there's any very good results for instance if you're hashing a lowdimensional vector into a binary discrete space. Now what is the transition performed?
That part, I don't see any people do that. So our emphasis is really related to learning.
Basically you take the data and do a transform mapping a different space. Now how
would you guarantee your solution is almost as good as the old solution you get from the
original data? So it's slightly different. Okay. All right, I hope that answered everybody's
questions. [Inaudible]...
[Audience clapping]
Download