>> Dengyong Zhou: It's time to start. So we are very fortunate to have Doctor Rong Jin as our speaker today. Rong is a professor in the Department of Computer Science and Engineering at Michigan State University. He is currently lead in the Seattle lab of Alibaba. His research is focused on statistical machine learning and its application to big data analysis. He has published over 200 technical papers. He received his Ph.D. in Computer Science from Carnegie Mellon University in 2003, NSF Career Award in 2006 and the best student paper award from COLT in 2012. And today he will talk about recovering the optimal solution by dual random projection. Actually this technique has been successfully applied to Alibaba advertising systems. >> Rong Jin: All right, so thanks. Thanks for the introduction. This actually is our last year's code work and in fact we, to some surprise, actually implemented it as a part of Alibaba's display ad systems and it seems to be working reasonably well. Okay, so just a slight clarification: I used to be at Michigan State so I did actually full-time join Alibaba as a leader of the Seattle lab. Okay, so the outline is first of all I really want to particularly highlight some of the challenges in solving optimization problems particularly involving high-dimensional data. And then, I also want to look at the property of the random projection becoming the key component for our theory. And then, we look at our particular algorithm which we call the Theoretical Projection Algorithm and how that can be utilized as the means to solving the high-dimensional optimization problem. And then, I will present our analysis and looking at different specs of the data as well as the solution to see how the guarantee of recovery can be performed. And we also have an external version of the algorithm which eventually allowed almost perfect recovery instead of just recovery with certain error bounds. And I'm going to look at experimental results based on the UCI data set as well as some of the highlights of our study by utilization this algorithm in the online advertisement system which is now deployed in Alibaba. So I'll conclude by ending the talk. So I think almost everybody knows the idea of random projection. This is an old and very effective idea. So basically the idea is that I have a very high dimension vector, say x. Actually I can map this high dimension vector in relatively low dimension by modifying with a random matrix. In this case we are really focusing on the Gaussian Random matrix. To a large degree you can generalize to other cases of random matrix like a random algorithm which explains this. So this is a very simple idea but it's really a very powerful idea because [inaudible] supporting theoretical evidence for utilizing this random projection is JL lemma. So basically the idea of JL lemma says that if you have a finite set of data points, x1 to xn, then you actually can utilize a random projection with n with 1 over epsilon squared with the size. With that amount of random projection you actually can well preserve the distance between any two data points up to the error of epsilon. I think most of the people are familiar with it so I don't need to further elaborate on the JL lemma points. So the idea of JL lemma actually has been widely used in various tasks like classification and clustering, regression and manifold learning. So this actually becomes the foundation for utilizing or justifying of random projection in many of the learning tasks. Okay, so now let's very much focus on classification or high-dimensional classifications problems. This has actually become the thing of this talk. So consider you have a lot of training examples – xi, yi – and each of xi is data points in a three-dimensional space. And let's assume that these are very large numbers. Your goal is to learn the classifier, f, which essentially maps any vector in the three-dimensional space to a discrete value plus 1 or minus 1. The straightforward way would be let's just do, okay, learn the classifier in original space; therefore, you specify your loss function – Here's l – and then you basically solve the regularized empirical loss and that usually becomes a solution that is a reasonable generalization error bound. So the solution of w is there that minimizes the regularized empirical loss. And then, you can view the classifier just based on the sign of x [inaudible] w star. Now the only shortcoming with this simple approach is that as the dimension of w gets to be very high then the efforts of solving this optimization problem is going to be very challenging. Okay, now one way to address the challenges that arise from highdimensional data would be let's try to reduce the [inaudible] data in [inaudible]. And the simplest way to do that dimension reduction would be through the random projection. So here is the idea that actually has been utilized by many [inaudible]. So what you can do is first generalize this random matrix or random Gaussian matrix, in our context. So A is a random Gaussian matrix. So this random Gaussian matrix allows you to map the dimension of vector xi into xi hat is only of m dimension. So the assumption would be that m would be much smaller than d; therefore, by this transform you eventually map very high-dimensional data points into very low-dimensional data points. Now with this mapping now you have all the data points now living in m dimensional space instead of d dimensional space. Now you can learn the classifier in this lowdimensional space; let's call it z. And this almost an identical optimization problem but the only difference is that now you only deal with data points of m dimension instead of d dimension. As a result, you can substantial reduce your computational cost in terms of solving the optimization problem. Now let's assume that optimal solution z is essentially the solution you get from this low dimension optimization problem. Then, your final classifier would essentially just be based on z star. So for any of the data points x, you're going to map them to a low-dimensional space [inaudible] with random matrix A. And then after this transform you can just directly dot product with z star and the sign of this dot product becomes your classification [inaudible]. In other words you can also redefine your w as a dot product between A and z star, and this w has essentially become your classifier in the d dimension space. So this is very commonly used methodology to deal with high-dimensional data. Now the question is how good is the w star? Can we have some way to categorize the performance of w star? And that's actually become the starting point of this work. The very first question you probably want to ask is, "When I learn this w star –" I'm sorry, "When I learn this w hat from this random projection solution, we would like to categorize this classification performance." Now this actually has been done by many people; I'll just cite one of the recent papers. So in general the conclusion is that if the original data can be linearly separable with a margin then you really have a good chance to get a good classifier by using the random projection. Actually the statement is this: if the data set can be linearly separable by a normalized margin gamma then if you have sufficient large number random projection then with a high probability all the data points that can be linearly separable by a slightly reduced margin. The claim would be that if you find a classifier based on the original data points, if you can find a very good classifier based on the original data points then you have with high probability found a good classifier with respect to the reduced dimensionality. Now apparently this statement is slightly [inaudible] because this only tells you the results with respect to the training data, but essentially there's another statement that allows you to generalize that to the generalization error bounds. But again everything was under this big assumption. The guarantee that w hat will have a good performance only the assumption that the data is linearly separable with a decent margin. So in the case that the data is not linearly separable or if the data can be separable by a very small margin then everything they promised over here is going to be falling apart. Therefore, it comes to the question: what happens if the data is not linearly separable? What happens if the data is separable but only with a very small or even zero margin? What kind of performance are we going to expect if we apply random projection to the data and learn the solution based on the random projected data? For some reason there are little answers in that territory to my big surprise because it appears to me that answer already appears in the old literature of delta functional analysis. So this is actually the [inaudible] question we ask. In the case of the data cannot be linearly separable or in the case of the data can be separable with a small margin then what you're really looking for is to look at how well the solution w hat can somehow approximate the ideal solution that you learn from the original data points without any random projection. So the fundamental question would be is there any chance this w hat could be a reasonable approximation of w star which learns from the original data points? Yes? >>: [Inaudible] approximation here do you mean similar classification accuracy or do you mean... >> Rong Jin: No, it's actually – It has to be similar vector because in the case of w star only using – either has a lot of mistakes or it has a lot of data points that align exactly on the boundary then you really want to assimilate w star rather than – You see what I'm saying, right? So it turns out the answer is absolutely no. In other words no matter how you design the learning of the z star, there is just no chance you can have the w hat that eventually becomes a good approximation of w star. And this actually is coming from the primary results from delta functional analysis. So the statement is as follows: consider you have E with your subspace in the space of d dimension. So E is a subspace in your d dimensional space with a dimensionality of m. So the dimensionality of E is m. Let's consider any random subspace of dimension m within the space of dimension d. So what you can say is that if you fix any point x in this d dimensional space and then you're looking at how far away this x point is from the random subspace E then the claim is that in general this fixed data point x will be very far away from any random subspace E as long as the dimensionality of E is small. So a more accurate statement would be that the probability for this distance between the fixed data point x to a random subspace E is smaller than -- this part roughly is an order of 1 if m is substantially smaller than d. It's smaller than epsilon. And this chance actually is very, very small; it's epsilon to the power of d minus m. For instance if we actually fixed the epsilon to be one-half then you can roughly claim that with very large probability that w hat to the w star – the difference between them measured by the [inaudible] would actually largely equal then some constant multiplied with by the order of 1 multiplied with the [inaudible] norm of the w star. >>: [Inaudible]. >> Rong Jin: Right. >>: So eventually you care about the [inaudible]. [Inaudible] to w star. >> Rong Jin: Right. So maybe I should elaborate with points further. As I said the original statement is all looking at the fact that you have a classifier then all the data points essentially are far away from this boundary. That's how, actually, you can get some guarantee. And the problem comes to if the data points are actually very close to the boundary that's where the performance is getting very bad. So as a result what happens is if you have this – Whoops. [Inaudible] Okay. So here's my w star. If actually you have the w hat, it turns out the angle between these two is substantially large. Then any points that are almost perpendicular to the w star would actually get very unclear results. They could get a complete different prediction compared to the w star. I don't know if I'm clear? My point is that if you can see any x, right, if this quantity is very small then if you [inaudible] w star by just a small angle then you can get very likely a very different sign. As a result all the good [inaudible] performance you can promise for w star would be falling apart if you you're getting w hat. I don't know if I make my point clear. And, therefore, I really want to be sure that the difference between the w star and the w hat is going to small enough in a way that I can handle it even for the data points that essentially have a small dot product with w star. Yes? >>: Just trying to understand your statement. So... >> Rong Jin: I'm sorry, which statement? This statement? >>: Yeah, yeah. This entire discussion. >> Rong Jin: Sure, sure, sure. >>: So say that the data is in a low-dimensional [inaudible] or low-dimensional subspace. So your distribution I kind of – it's not using the entire space. In this case you can think of many w stars that would be very different from each other that would have the same [inaudible]. >> Rong Jin: Right. >>: So it's not kind of – The statement that you are making is correct in the sense that if I can find w hat which is close to w star then I'm guaranteed to have [inaudible]. >> Rong Jin: Right, right. >>: But it's not necessarily the other way around. >> Rong Jin: I agree. I agree. So I think this is a good clarification. [Inaudible] tried to raise the point is that previous studies essentially look at the scenario: all the w, all the x roughly have a large dot product with w star. As a result even if I have some angle – so even if w hat [inaudible] some angle from the w star, you still do not suffer from large [inaudible]. So all I tried to argue is that if we're looking at more challenging cases, that is many of the x essentially have a small dot product with w star and those statements fall apart. Apparently there are others ways of addressing this question, but one more ambitious way to address the question would be if I can come up with a way to make a good approximation of w star by utilizing random projection then I almost can avoid this kind of mistake. So I do actually take your points. I think this is a legitimate right statement from you. I hope this issue would be settled down here? Okay? So, nevertheless, what I try to come back to over here is that essentially random projection from the viewpoint of recovery, it really does a very, very poor job. The only – There is almost nothing you can do about it. This is because the geometry of the highdimensional space. In the high-dimensional space of course now there's a small corner of subspace with low-dimensionality. We can almost guarantee that most of the points in this high-dimensional space would be far away from this random subspace, and there's nothing you can do about it. And, therefore, from that viewpoint a random projectionbased approach is not going to work out. And actually this is what we can see in practice. So in the case if the data is really, really easy, random projection uses a very desirable outcome; however, as long as you come to the territory of typical classification problems. Random projection is always falling apart just by using this simple idea. So I hope that message will be... >>: Can you just clarify the constants here? C, d and m are – d is the dimension, m is -? >> Rong Jin: All right, so the d is the original dimension, a very high dimension. So the E is the subspace with dimensionality of m. And this is a random subspace. So you pick a random subspace from a high-dimensional space d, and the dimensionality of this random space is m. So the point is that in generally any fixed points will be far away from most of the random subspace. Is that clear? Okay. All right, so that actually motivated us to ask the following question, that is: is there any way I can design an algorithm to be such that the resulting solution w would actually be a good approximation of the w star that you learn from the original data? Basically what we have is this w star which you can learn – Our target is w star that you can learn from the original highdimensional space. What's convenient from the viewpoint of computation is to learn z star that is from a very low-dimensional space. Any procedure that we can create to be such that I can construct a solution w hat based on the z star that essentially gives us a good approximation of w star. And that's the whole thing for this talk. The idea actually is very simple. It's really based on a few simple observations. Just for the sake of completion let me briefly talk about the primal problem which I think most people are completely familiar with. Basically you have this original optimization problem which stays in stays in the primal space and then, you essentially can create a new version of the original problem by, for instance, writing the delta function into the form of convex conjugate that eventually will turn this entire minimization problem in the primal space into a maximization problem in the dual space. So here the variable alpha is the dual variable with respect to the w which is the primal variable. So I think there's pretty much no need of elaborating on those points. Now again the relationship between the primal and the dual is very clear under the [inaudible] norm regularization. So basically if I know the dual solution alpha star then you can easily write down the primal solution w star. On the other hand if you know the w star then you can easily write down the solution for the dual by taking the derivative of delta function. So here, actually, I assume the delta function is the differentiable [inaudible] just to make life easy. So let's look at the problem we have. What we have is in terms of the original problem w we can connect the w with the dual. Well, we have the same connection for the problem that you have applied to random projection. So for random projection you would again have the same idea. You have a prime solution now, z. This is in the subspace of Rm. And then also based on the z, you can derive its dual variable which is alpha hat. So similar as before: if I know that alpha hat star then I will be able to derive the z and vice versa. So those are common sense. Based on these two observations the next most important observation we have is that it turns out that there are two dual solutions: the alpha star and the alpha star hat is well connected. So D1 is the dual problem for the original dimensionality. And D2 is the dual problem after we perform the random projection. So you can see the only thing different here essentially is between X transposed to X you're getting this immediate matrix AA transposed. So the connection between these two is based on this very simple observation. That is, if you take the expectation of X transposed times A times A transposed times X, it turns out this expectation would exactly give you X transposed times X. In other words, this version, the D2 version of the dual problem is more or less an approximate version of the original dual problem. So, therefore, there's a good chance that we can come up with some kind of analysis that allows us to say [inaudible] condition then the optimal dual solution for the random project data points would actually be a good approximately of the dual solution for the original data points. And that actually becomes the key of this algorithm. So here's the key observation. What we can do is we actually can construct the w star based on the dual solution alpha star. And we also can compute a z star and that route is through the alpha star hat. And we also claim that essentially that alpha star and alpha star hat could have a good approximate relationship. So if we piece this together then it actually uses a simple way to construct approximation of the w based on the solution of random project data points. Yes? >>: Can you explain [inaudible] why is it easier to get the approximation of alpha than it is to get the approximation of the original [inaudible]? >> Rong Jin: So here I just basically stated with the intuition so... >>: [Inaudible] >> Rong Jin: The analysis is later. All I tried to say is that just by looking over here because AA transposed essentially gives you a good – it's essentially [inaudible] of the identity operator. Based on this observation we somehow can claim that D2 is an approximated version of D1. And as a result we should expect the solution of D2 would be a good approximation of the solution of D1, and that actually becomes out foundation. I don't know if I make my points clear here. Okay, so the algorithm becomes – at least the highlighted structures become very simple. So what we do is we first compute z star. This is the optimal solution to the random project data points. And then, we can compute alpha star hat. This is actually a dual variable that's based on z star. And as we claim that this alpha hat star will be a good approximation of the alpha hat – So I somehow can essentially replace the alpha star hat with the alpha star then I actually can construct w star. So that's the logic we have. Going with z star derive the dual variable and use the dual variable as approximate variable to construct the w which is the solution in the prime space. Okay? I guess this is just the algorithm with highlights over here. Essentially it's just more details and an elaboration of the key steps I pointed out before. So let me just give some of the analysis results we have with respect to this simple algorithm. First of all the first case is this case that we assume that X is essentially a data matrix including all the data points you have. So if X is actually a low-rank matrix then actually the results, you can have a very strong guarantee. So the claim is if – r here is the rank of X. If r is actually large enough which is roughly in the order of r log r then you can have with high probability the solution we construct based on the dual random projection algorithm – recall w hat – I'm sorry, w tilde which is based on the random projection and w star, this two differences would be small in the sense that it would be smaller than roughly epsilon times the norm of w star. And epsilon is on order 1 square root of m. Okay? Roughly speaking what we said is if the [inaudible] random projection is larger than r log r then you can claim that the procedure you get – the solution computed by the the dual random project, w tilde minus w star, that difference should be roughly less equal than order of square root r over m times the norm of the w star. Yes? >>: So what would the [inaudible]? >> Rong Jin: So there could be two possibilities: one is lots of features are redundant; basically they are [inaudible] combination of the others. Second would be there are a lot of data points and eventually they'll be linearly dependent. Either way, I think. But this is creating a very, very strong restricted condition but this is actually the easiest scenario to [inaudible]. In the case if the data matrix if of low-rank then you actually can generate a small number of random projections and you can guarantee that the solution discovered by dual random projection would be very close to w star. And the closeness essentially is measured by one over square root of m. Now second of all let's look at slightly more challenging cases. So what happens if the x is not low-rank matrix? What if x is of full rank matrix? In this case we actually have to make an addition assumption because without any assumption apparently there's no way you can claim the random projection algorithm [inaudible]. In this case we actually introduce a variable called rho. This rho essentially measures how well or how concentrated w star is in the subspace of x. So the measure is as follows. [Inaudible] Here the U r bar: this actually is a subspace – this matrix include the smallest d minus r left singular vectors. So basically here is this – Then, by multiplying U r bar with the w star essentially I'm going to project this w star vector into the subspace [inaudible] by the lowest singular vector of data matrix X. So the claim is that – The matrix tells you how small that projection vector would be compared to the original lens of the vector w star. The small rho implies that the higher the w star is concentrated on the top Eigen subspace of x. So that's the quantity we introduce. This is kind of [inaudible] that I think it's better just looking at a simplified version – sorry, simplified version of the results. So the results state that if actually the Eigenvalue is – Let me go here. Basically it says that if the r plus 1 singular value of the data matrix X is small enough then also if the number of random projections is large enough then you will have roughly the difference between the solution constructed by dual random projection, w tilde, compared to the optimal solution, w star. That difference will eventually be smaller than this quantity. Okay? So as you can see this quantity has two components. One is essentially just square root of r over m; this is identical to the previous case when we assumed the matrix is of low-rank. Second is actually [inaudible] of rho. So the more concentrated w star is in the top Eigen space of x then the smaller this quantity will be. So if the rho is small enough then we still can claim that the difference between the [inaudible] vector versus the optimal one would be small than 1 over square root of m [inaudible] over [inaudible]. Okay? So this is the second result. In the case that even the data matrix is a full rank, as long as the solution somehow is reasonably concentrated on the subspace spanned by a top Eigen vector of data matrix X then you still have good guarantee by running this procedure. Okay? So our last result is related to sparse. So in the case if the optimal is sparse then you can still also get some promise out of this analysis. Again, here we have to make an additional assumption in addition to the sparse. So here we introduce – Sorry this an error. This wouldn't be a bar here. Let me use the S to specify the support set of w star. As I said w star we assume is a sparse vector supported by a small number of components. So S is the support set of w star. What we introduce is the quantity eta here is the measure of how concentrated the data matrix X is on those features that are being captured by w star. This eta is the original data matrix X transpose X minus Xs transpose times XX. This Xs is the data matrix that only includes those features that appear in your optimal solution. So this quantity essentially measures how well this data matrix is concentrated on those features that appear in the final optimal solution w star. And now the expectation would be the smaller this eta would be then the higher the concentration we have of the data matrix on those useful features, and as a result you're getting a higher guarantee on the results. So here are the results: in the case if this eta is small enough and then if the number of random projection is large enough – By the way, the s is the [inaudible] of the support set S. So the small s is [inaudible] of the support set capital S. So if the eta is small enough and the number of random projection is large enough then you can still claim that the difference between the w tilde and w star is going to be smaller than the square root of small s over m. So this roughly summarizes most of the analysis we have regarding this dual random projection. I don't know if there are any questions about...? >>: So back to your systems, what is this constant value? >> Rong Jin: We have no idea, so we can only do the trial and error unless we measure the entire Eigen space. In this case we – Unless we have, for instance, in this case we know the w star then I cannot tell you in front what would be the eta. >>: I have a question. Sorry. >> Rong Jin: No, please go ahead. >>: Again, we started out by looking at the kind of naïve random projection and you argued that this will return a result which is far away from w star? >> Rong Jin: Right. >>: But at the end of the day we care only about the angle between them, and I'm just trying to understand the difference between this technique and the naïve technique. >> Rong Jin: Naïve in what sense? >>: The one that doesn't work. The one which does a projection in the primal. >> Rong Jin: Right. >>: So there are given that the distance will be large. The large distance would be attributed to a big angle or it could be just a different length. >> Rong Jin: Right, right. >>: So what we care about is the angle. >>: Yes, in [inaudible] – I think this is slightly clear. In terms of [inaudible] classification we do only care about the angle but if we're looking at the general [inaudible] function then the magnitude of the w does – So if somebody gave you [inaudible] function then the magnitude, the [inaudible] w does actually [inaudible] the outcome. I don't know if... >>: [Inaudible] proved that indeed for the naïve primal result w star and w you get are very, very far apart but the angle between them is tiny. So it's only... >> Rong Jin: Oh, oh, I see. Okay... >>: [Inaudible] the argument [inaudible]. I don't think this to be true; [inaudible]. >> Rong Jin: Okay. Yes, yes, so let's first of all be clear that – Maybe I should make the statement more precise. So actually I particularly say that X is a vector which is the unit lens. So actually I completely wipe out the size of the vector X in terms of distance. In this case really the distance is captured in the angle. Therefore, here the statement indeed relates to the angle as you are concerned about. So as a result the original solution is [inaudible] – The first statement I really want to emphasize is that by utilizing the random projection you are guaranteed to get a solution which spans a very large angle from the desired solution you are looking for. >>: So w hat here is not the solution to the [inaudible] problem; it's a normalized version of that? >> Rong Jin: Yes, yes. In this case it's really the angle. So we are looking at the fixed lens, the fixed points with the unit lens and then looking at what would be the best approximation in you subspace with respect to that. Yes? >>: Is it possible – So you showed from your analysis [inaudible] using this dual method you can minimize – you can get much smaller distance between w hat and w star. Can we type back to the error? >> Rong Jin: So you mean generalization error? >>: Yeah. >> Rong Jin: Yes, actually thank you very much for pointing that out. Our [inaudible] theory did actually have generalization error but somehow here I keep on emphasizing the recovery so I did not put that over here. I mean, my title is starting with the recovery error so somehow it's not my main motivation of doing so. Yes? >>: In the algorithm w tilde is a deterministic function of z star. >> Rong Jin: W tilde is the deterministic function of z star. >>: So in terms of like classification error, it can matter to classification of z star and w tilde or equivalent. >> Rong Jin: Right. Sorry, say it again. >>: So because w tilde is the deterministic function... >> Rong Jin: I think [inaudible] then I really don't want... >>: ...of z star... >> Rong Jin: ...to jump into – Yeah, go ahead. >>: Yeah, because w tilde is determined by z star. >> Rong Jin: Right. >>: And if you only care about the classification error then you compare those two. >> Rong Jin: Okay, okay. So I think the statement here is – I see the [inaudible]. I agree with you [inaudible]. We have a z star and then we are getting w star and then going from z star to w star is a deterministic function – I think this is slightly concerned with the statement because this is under the assumption that you fix the random projection because I do actually have to utilize random projection as part of reconstruction of w tilde. Maybe I should really elaborate that. >>: Yeah, we can fix the random projection and then the relation is deterministic. And the classifier defined by z star and the classifier defined by w tilde are equivalent. >> Rong Jin: What do you mean by equivalent? That's the statement I don't know. So what I want to say, basically we build a transform. Take the z star and transform w tilde. And all I’m trying to say is that this transform depends on the random projection you have. It's not saying – That's the reason I'm really concerned about the statement you say, deterministic, because when you say deterministic you somehow hint the transform is independent from the random projection [inaudible] you make. So all I'm trying to say is that the transform I build depends on the random projection that's introduced. And as a result I'm not sure what exactly you're trying to say equivalent means? >>: By equivalence I mean if you take any input x and you put it into the classifier defined by a random projector and z star or a classifier defined by w tilde, do they always output the same... >> Rong Jin: I think this cannot be true; otherwise, why do we have deep learning? Why do we have kernel learning? They're all based on the same set of input vector. If you're statement is right then any algorithm would be equivalent as long as they're seeing the same input. >>: But he's only pointing to the dual parameter spaces you [inaudible]. >> Rong Jin: But z star is a very small parameter space, only of m dimension. [Inaudible] is a very high dimension space. Think about deep learning, right? They take this very low dimension space or kernel learning and they eventually generate a much larger number of features and that capacity enables them to improve the classification accuracy. So I'm not sure I can take the equivalent. >>: So z star is obtained by projecting your [inaudible] data into a low-dimensional space. >> Rong Jin: Right. >>: So it only contains the information of the low-dimensional space. And because w tilde is a function of z star... >> Rong Jin: I totally understand. But the dual variable they utilize depends on the original data points. So after I get z star I actually generate the dual and the dual depends on the original data points. Oh, sorry. That's not what I'm saying. The reconstruction depends on the original data points. >>: Okay. >> Rong Jin: So that's the difference. Everything over here, I'm really working in the subspace. But when I reconstruct w tilde, I actually take – basically I assume that this alpha hat star is a good approximation of the dual variable. And I take that and I basically just plug in the relationship between the primal and the dual. But here now I'm looking at the entire data matrix. >>: I see. >> Rong Jin: So, sorry. I think I shouldn't skip that part but I looked at the time and it's really limited. But this is an excellent question. I hope that clarification makes sense. So indeed we do actually look at that part, the original data. And that makes a difference. So if I work only within the random subspace then I should actually suffer from the same guarantee or the pessimistic results that have been stated from a [inaudible] functional analysis. But we did actually go one step further by utilizing the original data matrix as part of the reconstruction procedure. Yes, please? >>: How do you choose m? [Inaudible] >> Rong Jin: Right. Good question. I actually don't have any good answer. I think all I'll try to say – I agree with you. One thing we don't know is that we don't know really the property of the data, the property of w star. If we know the property of w star probably I already know w star. So... >>: [Inaudible] small m [inaudible]. >> Rong Jin: Right. I agree. I think there's another step which is how to make it practical or more effective rather than just scan through the entire m space. Maybe there is some way that you can quickly lens the idea of the number of random projection which I really don't know. >>: But you use it in application? >> Rong Jin: Right. We actually do the cross-validation. >>: Okay. >> Rong Jin: But it's in a naïve way so [inaudible]. It's not the most ideal way. I don't know if there're any other questions. Yes? >>: Did you try some faster Johnson and Lindenstrauss transformation? I mean all the results on your slides are based on the Gaussian random projection. >> Rong Jin: Right, right. Excellent, excellent. Yes. >>: So some [inaudible]... >> Rong Jin: This is actually an excellent question because in fact unfortunately I cannot elaborate in more detail on the work I've done with [inaudible]. In those cases actually doing random projection becomes the most heavy computational part. Actually the rest would be a small story. So we did actually further – I think recently we further proved that – One way to do that is by using the faster... >>: Faster [inaudible]. >> Rong Jin: Right. But also there's another line of research which actually is coming out of the [inaudible]. They're using actually sparse matrix for [inaudible]. We actually explore that theory and we can show that in the case if the solution is sparse – Now it's really looking at the sparse cases – then even with the sparse random projection things work out. In practice that seems to be the case. But those are excellent questions. >>: And I have another one, sorry. But, otherwise, for your results for the exact sparse w star, I'm wondering for that model did you consider L1 utilization or just L2? >> Rong Jin: Again, excellent question. In fact in the L2 then you really cannot get exactly sparse. We do actually have the results for approximately sparse but I thought I got too dense on the results. So I actually wiped it out. So we do have very similar results. It's pretty much like the idea of this. We are actually getting additional [inaudible] coming out of the approximate sparsity [inaudible]. In the case of the solution being very well approximated by a sparse solution then I can claim that roughly the same holds for the guarantee. I don't know if there are any other questions? Okay, so I know that I'm pretty much running out of time. We did one more version which is that all the previous results only give you an approximate solution, an approximation error is an order of 1 over square of m. And we actually further designed an algorithm which allows you to reduce the error in a dramatic way. Basically the logic is very simple. If I'm running this procedure by one step then roughly I can reduce the error by a factor of epsilon. So it's very appealing of I actually can run this procedure multiple times and each time I'm getting an epsilon reduction. Then I can have this exponential reduction in errors. Indeed that's the case, right? So consider in the second step we actually – instead of targeting the w star, I'm going to target on the difference between the w star and the w tilde. And I assume that'll become our target for approximation. Now if I apply a similar algorithm to approximate delta w then intuitively I should get the same. So the guarantee which is delta w tilde, the approximate version of delta w, the difference between them would be actually an order of epsilon times delta w. Now if you construct the new solution by adding a w tilde to the delta w tilde then you can easily see that this new solution will actually have epsilon square error. Therefore, it's very easy to run this procedure multiple times getting a geometry reduction in terms of approximation. Yes? >>: Something about [inaudible]. So in order to do this [inaudible] random projection you [inaudible] but you use data... >> Rong Jin: To recover. >>: To recover. >> Rong Jin: Right. >>: But you don't really know the [inaudible] between w star... >> Rong Jin: Right, we don't know. We have no idea; w star. That's right. >>: So how can you do this procedure? >> Rong Jin: So all I'm saying is that if I can come up with a procedure in the way that in the second step instead of targeting or approximating w star, what I want to target [inaudible] is the delta w which is the difference between w star and w tilde. I know w tilde, right? >>: Right. >> Rong Jin: So, therefore, I could actually construct an optimization problem in my second round by somehow wiping out the w tilde. And then, eventually in the optimal solution should be delta w. Right? >>: So the second iteration? >> Rong Jin: Right. >>: You're talking about if your target would be the difference between, say, the original prediction. >> Rong Jin: The original solution. >>: ...data set and the approximation using w tilde. >> Rong Jin: That's right. So I construct the optimization in the way I make the delta w to be as optimal. And then if I apply the same procedure I should get a guarantee which is the solution delta w tilde would only have epsilon relative error compared to delta w. Now if I combine the w tilde with the delta w together which is this new solution then it's easy to argue that [inaudible] error you get is going to be an order of epsilon square. And that's the logic we have. Do I make sense? Okay, then you have this average. The only thing I really want to say here is that – I don't want to jump into the slide in detail because I'm really running out of time. One of the key things I want to say here is that even though I'm running this precision multiple iterations, I only do random projection once. So it's not the case for every iteration I have restart with the random projection. As I claimed before random projection is part of the most expensive components in terms of computation. So what you can claim is that if you're running multiple iterations which is called w tilde t then a difference between them would be essentially epsilon to the power of t multiplied with the lens of w star. So you do actually get the geometry convergency by running this procedure. And you only do random projection once. So let me very quickly go through – Basically we first do a synthetic data set. In this case we have basically constructed the data matrix X by multiplying A and B together, both would be low-rank Gaussian random matrix. So here is the comparison of the dual random projection algorithm to the naïve algorithm which is the first one that I claimed. As you can see roughly the error between them is very large. So we can see a very quick reduction in terms of recovery error. Well, you don't see much reduction in terms of the naïve [inaudible] recovery errors. Running time – Okay, so maybe that's not --. This not terribly interesting. Understand that we're actually running on the text classification problem using the rcv1 binary data set, so the dimension is roughly forty thousand-some features. And we used half as training, half as testing. So the horizontal here is the number of random projection; the vertical here is the accuracy. So we have three lines here. The top dotted line is the one using the original data and then, the blue line is using the naïve approach based on just z star directly. And then, the red line is the dual random projection. So as you can see dual random projection is actually immediately getting much better results compared with the simple random projection algorithm. This again confirms that not only are we just getting a good recovery error, we do actually have better results in terms of the [inaudible] error. So as we see here, with respect to the testing data rather than the training data just to be clear. Okay, so let me just very briefly say the work I've done with Alibaba. So interesting enough, this actually not related to the classification; this is actually related to an interesting problem about displaying advertisement. So in the difficult scenario of displaying ads basically somebody was visiting the [inaudible] website. Essentially the website is published by Alibaba. For each [inaudible] PV the system will have a way to categorize a profile in the [inaudible] user, and based on profiling they will try to identify the subset of ads, the best match with the interests of the user. And that's become the choice. As you can see this is a very, very greedy algorithm. Anytime you come and visit, I always find out a subset of ads that best match with the interests of the user. This is a very, very common practice. Now this sounds like a very intuitive idea. There's only one caveat somehow which is overlooked by many people; that is, the budget. So for each advertiser they actually have a limited budget in terms of presenting their ads. So as a result they only can present their ads to the user a relatively small number of times. If you're taking this greedy algorithm, one of the potential problems that can result is that there will be a very big mismatch between the supply and demand. So when I say supply I mean the number of people that visit [inaudible] with a certain interest. When I say demand I mean what type of interests that are specified by the advertiser. So here's the [inaudible] actually I gave out here. The horizontal line here actually is each different sort of interest group you can think of. And then, you may not see very clearly but I think people who see clearly on this end, for each interest group we have two bars. One bar is about the supply; that is, how many visiting people we have with that particular interest. So this is let's say the red bar. And then we also have a blue bar. The blue bar is the number of advertisers -- or number of potential visits that an advertiser is allowed – they can budget for that particular interest group. Because they only have a limited amount of budget; therefore, they're only allowed for a certain number of display. So as you can see actually we have a very large gap between the supply and demand. There's a certain – So the interest or topic that a lot of people have, well, unfortunately there's not much supply if we take the greedy algorithm and vice versa. Okay? I don't know if I make my point clear. So all I'm trying to say is that essentially each advertiser they specify what kind of people they'd like to target. But because of the budget, they cannot show the ads forever. They only have a small number of times to show the ads. Now if you always choose the best match to advertise then you really will run into this big problem. The problem is that you have a huge mismatch between the supply and demand. Yes? I see the hand over here. >>: [Inaudible] in multiple... >> Rong Jin: Yes, yes. So in most cases you like shoes, you like outdoor things – Yeah. Each people will actually – At the same time each advertiser they also could target a different set of people. It's not just one single laboring. But if we're taking this greedy algorithm, these are the things we see. We see there's a very big mismatch between the supply which is essentially the number of people with the same interest and then demand which essentially is the number of advertisers that like to target [inaudible] the people. And that's the problem they have. So I don't know if I – Unfortunately I cannot give very accurate descriptions because somehow I was not allowed to give a very detailed discussion. But I just wanted to give you this high-level problem that I hope people could appreciate. Okay, so the way that we actually target this problem would be instead of taking this simple greedy algorithm which is very convenient in computation, can we actually do a much bigger optimization problem? What we do is we really want to minimize the gap between the supply and demand but actually pre-computing for each user. We would like to pre-compute what would be the best subset of S that I can display to him? Now in this case I'm really now just looking at whether the ads are the best fitting with respect to his interest. We also want to take into account there's a budget constraint that's basically for each advertiser they only have a relatively number of chances to display to the user. So when we do this optimization, we really want to take into account not only just the interest of the user but also look at the gap or the potential difference between supply and demand. Therefore, that's become essentially an optimization problem. We like to estimate a thing we call a matrix A and this is actually a user-ad assignment matrix. The number of users we handle is around 10 to the power of 8. Number m we handle is around the order of one million. And we like compute every day – At midnight we like to compute this matrix A and this A will decide for each individual user which ad he's going to see tomorrow. And that information – the A – will essentially become a guidance of display for the next target advertisement. Again, as I said, I cannot give the details so, therefore, I stated everything on a very high-level unfortunately. Basically you have this optimization problem, so you have some [inaudible] function f A and then you have this regularlizer. And you'd like to solve this optimization problem. Now the only issue here is that the matrix is really, really large. And the disaster is that we actually are only allowed to compute the matrix within one hour. Yes? >>: So there's no details but there has to be some kind of separable format? >> Rong Jin: Yes. So there's a lot of summation here. This is what I'm saying. I was talking to people and they clearly indicate I cannot give any details. So I can only put f here. >>: In order to apply [inaudible]. >> Rong Jin: Right, exactly. You're absolutely right. This is actually is even more than millions of a summation, but I cannot give anything more than just f. So I hope that you can bear with that constraint. But, nevertheless, I think everybody can understand we really have a peak variable to optimize. This A is 10 to the power of 8 times 10 to the power of 6 matrix. Even though this is a sparse matrix for sure, but still it's a gigantic monster we have to handle and even worse is we only have one hour to actually complete the optimization. And that's why we applied the dual random projection to it. We roughly applied the same idea except that now you really have two sides to handle: not only just the user side but also the advertisement side to handle. Therefore, the theory is slightly more complicated over here. Nevertheless, the same logic is actually applied over here. And it seems like it works very decently. Now it's actually online as a part of the advertisement system. Okay, so I think I'm pretty much done with my talk. Sorry? >>: So maybe just... >> Rong Jin: Sure. Go ahead, please. >>: That problem [inaudible]. >> Rong Jin: Oh, sorry. N is the number of the users and m is the number of advertisements. So there is an order of million advertisements. >>: I mean the projected dimension. So with this A you do that to probably vectorize it. This is a huge vector. >> Rong Jin: No, no actually we're not vectorizing. We still maintain it's a matrix. >>: Okay. >> Rong Jin: So basically we're just taking this larger matrix and projecting it to a small matrix. >>: And then, what's the size of the small matrix? >> Rong Jin: I'm not sure I'm allowed to tell you that detail. So for this reason I keep everything on a very high-level. So I cannot give you the details. But nevertheless, this is real. This is actually working now in the system. >>: Are you only assigning one [inaudible]... >> Rong Jin: No, multiple. Multiple. >>: Multiple? >> Rong Jin: Yeah. >>: And when the user arrives there's the... >> Rong Jin: You look at the tables. >>: ...[inaudible] of deciding. >> Rong Jin: That's right. That's right. Again, let's keep all the details away. I'll just tell you at least computation-wise one fundamental issue is how to compute A. Apparently in practice there is the more practical issue of how to convert this real number matrix into the [inaudible]. So that's another very practical issue. Yes, I agree. >>: Sorry, I have one comment. >> Rong Jin: Sure. >>: It seems that the method you use is [inaudible] to select a random projection instead of computing random you compute some kind of weight, smart weight [inaudible]... >> Rong Jin: Okay, so let me be clear: the random projection is indeed random. The matrix is generated independent of the problem instance we handled. These two are completely independent. >>: The w then you constructed actually depends on the data. >> Rong Jin: That's right. Indeed, the w constructed depends on it. There's no way you stay away from the data to get something good enough. >>: So when do you random projection do you consider – I mean I understand A is a huge... >> Rong Jin: Right. >>: So do you consider using an attribute of user? >> Rong Jin: Oh, okay. Good question. I think that there a couple reasons [inaudible]. I think there are more hints if we can use some meaningful features as the way to reduce the damage [inaudible]. Right? Right, right. Very good point. Unfortunately in the case we probably have 40 percent of people that have a very small amount of visiting information to [inaudible]. There's like 50 of people or around that ballpark have regular visits to this site. But there's a large amount of people who do not have enough of data that you really can figure out there information, like demographic information, etcetera. So you have to probably handle them – I think you're strategy I haven't tried out but it may be able to handle those regular visitors. But for those [inaudible] you'd probably have some trouble categorizing them well. In our case we'd probably just do a [inaudible] optimization and forget about anything else. >>: Have you done any work extending this to non-linear methods? >> Rong Jin: No. So we did actually try to look at the kernel using the connect [inaudible] random Fourier and then some how -- But I don't think it's something dramatic. I mean you're following [inaudible], I think you can get the [inaudible] error bound in some way but it wasn't a particularly exciting way to do that. So we didn't think this is a big enough work to really explore. >>: It seems interesting to potentially do a random Fourier projection to enlarge the space and then kind of use that as the space... >> Rong Jin: We do – I think this might still improve the bound but I thought it was kind of trivial extension with something completely naïve and in no way enough, so we didn't seriously pursue that. >>: What about kind of empirical, whether that could be a strong empirical method versus current [inaudible]? >> Rong Jin: Actually I don't know. So we just don't feel too excited about it [inaudible]. Maybe it's my bias so, therefore, we don't... >>: From a theoretical standpoint? >> Rong Jin: Yeah, and therefore we don't even... >>: Or from an empirical standpoint? >> Rong Jin: We don't even bother to check out the empirical. But maybe that's the right thing to do at this point. But we didn't do anything around that. Yes? >>: So with such high dimension, I imagine that your random project matrix will be impractically very big. So... >> Rong Jin: Right. The question was asked before, yeah, so go ahead. >>: It's kind of like [inaudible] whether you have tried some of those, like, [inaudible], random projection [inaudible]. >> Rong Jin: Right, so actually our approach is not – When I say we're actually looking at the sparse random projection. There's another line of comparisons looking at sparse random projection. We could expand it, right, in a graph theory. >>: I have one question. >> Rong Jin: Please? >>: About the formulation [inaudible]. >> Rong Jin: Okay. Formulation? >>: I don't know [inaudible]... >> Rong Jin: It pretty much means nothing. I mean as long as they forbid me to give it... >>: So why do A – My problem is here. >> Rong Jin: Okay. >>: If you have to solve A, that is a given. Your user and your ads you have to compute... >> Rong Jin: Right. Right, good point. >>: [Inaudible]. >> Rong Jin: So – Let me be sure I only give you the information I'm allowed to give you. So what I can say... >>: Because I don't the exact properties that's why I ask about it. >> Rong Jin: So I can give you this rough idea. Basically there are two types of users. One is [inaudible] old user. The other one is a new user. So those new users do not have any activities in [inaudible], so this is the first time appearing in [inaudible]. I think apparently if we do this – So the computation of A only accounts for "old users." For any new user, there's no way. I don't now if he's going to come to [inaudible]; therefore, how would I figure out what kind of ads. But fortunately in terms of [inaudible], new user count is a very, very small amount. >>: What about new ads? >> Rong Jin: So this is a good question. In general for most advertisers, they actually specify their budget ahead of time. So basically they will tell you how much money they're going to put out. They're allowed to change but the chance is not very large. So most of the people will tell you how much I want to put on my advertisement one day before they're actually displaying the ad. So you do actually have enough data to tell you what kind of budget they have or additionally we can have a very reliable estimation about it. So I agree. If things are completely open online, there's nothing you can do. We can only do the online version of the things. Yeah, another question over here. >>: Have you done any work on multi-class classification? >> Rong Jin: No. I think we only do the binary. So those are good open questions. I really hope to see some of the results [inaudible]. >>: So most of the things that you talk about are on analysis from high dimension to low dimension. >> Rong Jin: Exactly. >>: Now I have seen many applications where you have low dimension or dense features and you can't use random projection back into high dimension. I have seen some application of that [inaudible]. Then there is this style of learning. So like it's a kernel method [inaudible] project... >> Rong Jin: I see. I see. >>: ...random projection. Do you know any kind of analysis on this [inaudible]? >> Rong Jin: Okay, so I think if you take a low dimension vector, and let's assume everything will stay within the linear. If it's non-linear then it's not meaningful analysis related to our work. So if you have low-dimensional data and you do this transform mapping it to the high dimension, if you're only doing everything on the linear side I'm not sure how much you can get. In general I don't feel you can get anything out of it. So for instance, for kernel they are not exactly random. They're actually mapping it by deterministic function which is essentially the basis function of the kernel [inaudible] space. >>: That's not a linear point. >> Rong Jin: Right. Right. So in terms of linear I don't see any motivation of mapping to high dimension. >>: And I'll also mention I do have non-linear but you can do the linear mapping first and then you apply... >> Rong Jin: Oh, I see. >>: ...the [inaudible] function and from there you can do features in high dimension. And people actually use that to do some useful things. >> Rong Jin: Yeah, that sounds very interesting except I really don't have an input [inaudible]. Right, but I really don't anything about that. But it sounds very appealing which I hope to see a good empirical [inaudible]. Okay, so any other questions. Oh, [inaudible]? >>: Hashing methods? >> Rong Jin: Hashing. So there are two things... >>: [Inaudible]. >> Rong Jin: Right, right. So good question. I think the difference would be two things: one is that typically speaking you're actually looking at the binary coding of the vector. And second is actually hashing really is looking to preserve the geometry about the data points you have. So it does not have anything do with the learning procedure in my opinion. I'm not sure there's any very good results for instance if you're hashing a lowdimensional vector into a binary discrete space. Now what is the transition performed? That part, I don't see any people do that. So our emphasis is really related to learning. Basically you take the data and do a transform mapping a different space. Now how would you guarantee your solution is almost as good as the old solution you get from the original data? So it's slightly different. Okay. All right, I hope that answered everybody's questions. [Inaudible]... [Audience clapping]