18205 >> Dengyong Zhou: It's my pleasure to invite Jian Zhang as our speaker this afternoon. Jian is from Purdue University, the Department of Statistics, currently Assistant Professor. And before he got his Ph.D. from CMU in computer science in 2006. His basic research is including statistical machine learning. So, please. >> Jian Zhang: Yeah. First I would like to thank Dengyong for inviting me here. So it's really my pleasure to visit here and also it's very nice to see some old friends. Okay. So my talk today is about large-scale learning by using a technique we call data compression. As you will see, this is actually a very simple idea. And this is joint work with Zong Yen. Okay. So here's the outline of the talk. And I'll first introduce -- talk about the challenge. So the massive data challenge for large-scale machine learning. And I'll also talk about some possible approaches that can be used to handle this. And after that I will introduce a method and give you the motivation on why this method will work. Then we will actually show you the analysis about the property, the statistical properties of the method. And after that I'll present some grouping algorithms which will be used to compress the data. And finally I'll show you some experimental results and conclude. So what is the challenge? I think one of the most important challenges in machine learning nowadays is how do you learn with massive training data. And, in particular, there could be three difficulties. The first one is about training time complexity. When you have a dataset with thousands or even millions of examples, how do you efficiently and effectively learn a predictor, a classifier or regressor. The other thing is about testing time. And often it is the case, especially industry, that you may need to apply the classifier you learned and make predictions for, say, millions of Web pages. How can you do this efficiently? And this actually is also related to -- this is particularly important in the large scale setting because we have a lot of training examples. Typically, you want to go, the more examples you have, the more complex model you probably will choose is going more towards a nonparametric side. So in that sense it's not possible anymore that you can simply store the classifier using the simple weight vector and doing the efficient prediction. So you really need to use the training data to make predictions. And the third one is about storage. And this may not be as severe an issue as the previous two. And also there is often a trade-off between the memory and the training time, for example. But this could be a problem in situations, for example, when you want to deploy the classifier into some portable devices like PDA and you have limited storage you can use for this case. And there are many examples that fall into this category, such as if you want to classify Web pages retrieved by search engines or if you want to do pattern recognition for a DMA database, and also in cases where you want to classify or model not only the network data or financial stock markets, so there you have tons of data. And the last example I put here is the emerging online blogs and videos and also the social network data. So those are all examples of this large scale massive dataset we need to deal with. And so here I would like to briefly go through several possible approaches that can be used to handle such a situation. And actually some of them are not originally invented for this particular problem. But the reason I list them here is because they can be used at least as a way to handle the problem. And of course I will mention the limitations. So the first one is it's quite simple. It's called random down sampling. So the basic idea is that you get a random sample of a subset of original training data and then you run a classifier or regressor, whatever the algorithm. It's fairly easy to use. But the problem is very obvious. You simply discard a lot of valuable information. So you're basing your data. And this is actually a popular method for the unbalanced classification. So there you have, for example, very large number of network examples but only very few positive examples and how to handle the unbalanced dataset. And the second one I list here is the -- well, active learning methods. The goal of active learning is that you want to sequentially seek training examples with labeling, as for the labels, for example, which can mostly improve the classifier or the predictor. Typically this is done by various criterion such as you want to find the one that's maximum uncertainty or you want to, if you're basing, you may want to find the example such that after you add the sample and the posterior distribution will be the, the variance of posterior distribution can be greatly improved. This one is the -- this is the view -- the method of for this approach is of course it can be very expensive to find out which example is the most valuable to update your classifier. So in that sense it's more suitable for problems where the label is expensive to obtain. But in this case we have plenty of labeled data. So we want to make fully utilization of these things. So this may not be that appropriate here. But that makes a possible approach. And the next one is really not, is really not a single method. It's something I call a distributed computing. Essentially the idea is to utilize the distributed computing environment for your learning task. So one very good example is this Map-Reduce, multi-core machines. Map-Reduce has been used to help solve a large scale information problems, for example. And this can be very efficiently done. It can really be used to handle very large scale dataset. Now, the problem for this approach is the following: And often it's the case that it's nontrivial to try to make an algorithm distributively or parallelize an algorithm and it may not be applicable to many of the learning methods. The last one I want to -- the last category I want to mention is something I call it sequential learning methods here, but essentially the well-known examples are online learning or if you want to train your classifier using stochastic gradient decent method, for example. So for this type of approach, basically they will update your classification rule by sequentially adding, going through the examples one by one or it could be a small batch by small batch. You start off trying to solve the single optimization problem as a whole. This type of method is very scaleable. They are very scaleable, and the problem -- but they can also cause a lot of time because sometimes you have to go through multiple passes of the data and also you need a storage of the data while you do the training. And for the prediction, if you are running nonparametric models you need a rather large storage as well. And this method can -- I want to mention it can also be computing a distributed way. For example, small [indiscernible] they had some paper recently about how to run this in a distributed computing environment. So those are the possible approaches. And the goal here actually we want to have a method which is scaleable and in terms of both the training time task time as well as the storage. We want to find something that can be very scaleable and efficient for very large scale method training set. So I want to first give some notation here. So assuming that you have some idea of the valuations X1 to XN follows from a distribution PXY which is typically fixed but not known. And for binary classification, you have these binary labels 01 or plus or minus 1, whichever way you want. And for regression this is real, and we're getting a training set. In our situation this is typically very large. And PXY is a normal distribution as we mentioned. Learning algorithm is a procedure which takes this training set and produce a predictor. So it's a function of the training set. And typically you will also search, when you try to find this predictor, you're also searching a function class. It's often called the hypothesis space. It can also be written as a function of the H function class and the training set. And we'll mainly be focusing on the supervised learning problems, which essentially are the classification and regression. And in this case, in order to evaluate the classifier or your regressor, and we need to define some loss function. So here L is a loss function. And this is the function that we will be using to evaluate the goodness of the classifier, and the risk of a classifier is defined in the expected value of this loss function. The expectation is taking over PXY. So this is the risk. And this guy itself is a random variable, because it's a function of the DN. Since the classifier asked to be using the DN. The one thing I want to mention the loss function is the one we use for evaluation. There's not a loss function which is often called the surrogate function. That's the function you use to train the classifier. Now, in this talk we don't differentiate those two. We simply assume they're the same. But there are situations that they are different. For example, if you want to evaluate using 01 loss and it's hard to direct a minimizing [indiscernible]. It is an atomic surrogate function. But it is well known there are certain relationships between the excess risk, if you use two different loss functions. And also I want to give the definition of the risk consistency. So a learning procedure, it is called risk consistent if the expected risk converge to the R star. R star typically is the minimum possible risk you can achieve over all measurable functions. So this is the biggest risk in classification setting. And essentially this is saying that as more and more training data is provided, you should converge to the optimal solution. Optimal in the sense that you also produce the minimum risk. Okay. So how do we learn with compressed data? Okay. Now let's look at the standard learning problem. This training strategy N and we will be focusing on the, one of the most popular frameworks. This risk, regularized risk minimization. So essentially you want to solve this optimization P1. This is saying that you want to minimize empirical loss over the training data since you have the Y labels you can do that. Plus some regularization term. So this G of H is roughness penalty which we are penalized very regularly or complex functions and lambda is just a trade-off, the parameter controls the trade-off between the goodness of fate and the model complexity. And here L is the surrogate loss function. As we mentioned earlier we assumed it's the same as the loss we use for evaluation. So this is a learning problem and for the large scale, learning with massive dataset and even though often people use convex loss functions with relatively simple hypothesis space, and this could be very challenging, because N is very large. Essentially you want to minimize this objective but for very large N. Thousands or millions of examples. And how can you effectively and efficiently solve this problem? So that's the challenge. And so what do we propose? Well, we propose the following simple idea, which is first you want to do a partition of the dataset. Let's say your dataset is DN. Now you want to do a partition GN. And such that in the partitions there are M N sets. The reason I have this small n subscript, is because those should typically depend on N, or it's a function of N. So as N changes, MN should change. But I'm going to suppress the N script in the following. Assuming you have this partition, and of course for partition you need to have this, you need to have the whole intersection, and the intersection is empty. I'm going to define the compressed example. What is compressed example? For each set in this partition, the compressed example X tilde G, tilde G is simply a weighted average. So X tilde will be the weighted average of all X and Y tilde J will be the weighted average of YJ. W is weighted function. Depends on which X you use and which is the set in the partition. W, we assume it to be normalized weight. So submission X. If you sum over all the X for a particular set in the partition, you get 1. And this XG/YG tilde can be thought as the representer for the particular set. So for each set you come with one representer. A simple way to understand this, even though this is written in a more general way, a way to understand this, you can think of this WX, WX/IAJ as 1 over the cardinality of AJ, if XI belongs to this set. Because this is a partition. And otherwise you just take a 0. This is the simplest way to do this, to interpret this. If you do this, then it becomes a simple average. So it's really simple. So you partition the data, you take the simple average. For X and for Y as well. This is how you get this compressed, we call it compressed examples. And with this compressed data -- and how do you do the learning? Well, again it is very simple. You don't even need to change the software. So you just do the following. You feed in the compressed data and you treat XI tilde, XYD tilde as your data and then you feed this model. It's the same average as you learned before. This NG is simply the number of examples in this set AJ. And if you think of they are all the same size and this will be gone. This will be cancelled with this. Instead of 1 over N, you have 1 over F. N is the number of sets in the partition. But when you have different numbers you want to weight them because you have different value information in each set. You want to properly weight that to make it more efficient in the sense of submission. This problem we call it a P2. So this is essentially how we solve the learning problem using the compressed example. Now, the question essentially is, okay, now if we do the compression this way, obviously this is much more efficient to solve than the original way, because typically M will be much smaller than N. And the natural question to ask is, first, does this one work, if you simply average, does this one work? And the second one is: How does it compare to the simple approach which is the random downsampling. What if I randomly sample N and then stay with the method and run it. The other related questions are how do you treat the partition, how do you treat the weight function. Those are all the questions that we want to address. So any questions? >>: When you said that this is obviously more efficient because there's less data, I think the third thing would be to, the time it take times to minimize, plus the time to find the partition, tell us a way to find the partition which is significantly faster than minimizing. >> Jian Zhang: Exactly. That's a very good point. In fact, that's indeed the case. You are right. So you need to consider both the way you find the partition and the compressed data and how you learn this. I agree with you. So here is a simple motivation. Okay. I always want to use this, simple examples to show this. So we want to start with simple, and you want to find some things to motivate things but still keep the interesting structure that you want to study. This is a simple one. So consider linear, the simple linear regression, and in this case we have Y equals XI beta plus error. And to make it even simpler, let's assume XI are standard normal, my version of standard normal. Epsilons are independent of X and is normal with various sigma squared. In this case we can write down the closed form solution, if you use the Lees squared estimator. Doesn't matter. This is one is simpler. You use Lee squared estimator and you get closed form solution. This one is unbiased. Variance, you can calculate that, equals this quantity. Now if this seems different from what you've seen, it's because here X is random. So typically how the variance of sigma square, X transpose X inverse. If you take another expectation over X because X is random that's what we get. It's a measurement about how good it is, how confident you are about this unbiased S major. This is what you get for this simple linear regression problem. example. >>: Now, let's look at what you will get if you use this compressed What does P do? >> Jian Zhang: Sorry. P is the dimension of the X. How many features. So now let's consider the following. Suppose that we randomly contract the compressed samples. By randomly what I mean you get standards from standard normal and for every K samples, K is N divided by M. For every K samples you average them. This is how you get the X tilde, Y tilde. You take the average, and it is easy to say that because we started to use very simple example. So in this case the X tilde and Y tilde, still satisfy this simple linear equation. So Y tilde equals X tilde, multiplied beta plus epsilon tilde. Epsilon tilde is the average of the noise in the previous case. And in this case, the distribution of X is changed. Its normal with 0 mean and the variance is reduced to 1 over K, instead of the identity. And the error also ID the variance is again changed, a changed to 1 over K. So this is very simple. And for this case again we can work out the lease squared. So what will happen, if you work out this one, it's unbiased again, you get this quantity. Okay. Now, I'm just looking at this one. If you compare this quantity with the previous one, actually it's a little bit disappointing? Why? Because if you think about it, it's sigma squared divided by 1 N minus P minus 1. And what does that tell you? Example you look at the same performance if you random sample M, which is disappointing. You do a lot of extra work, you get the same result, what's the point? But actually if you pause a little bit and really look at how you get this result, you actually find something interesting. Hopefully you agree with me. Which of the following? If you catch this variance you get the product of those two. And the variance of the error is reduced to 1 over K epsilon squared instead of epsilon squared which is good. This is fairly intuitive. If you average noise, you get better. You get smaller variance noise. But on the other hand, even though you improved Y through the compression of the Y, the compression of X also changes the distribution. Now, the variance of X is also reduced. And it turns out that that has a bad effect about the S major. This is also hard to understand. If you do regression, think about that. If you have X spreading out it's very stable. If you push them together, it gets very, very unstable. You get a bad result. Now, it happens those two cancel each other and you're back to the original random sample. >>: This is basic, how can you have a classification problem with labels [indiscernible]. >> Jian Zhang: >>: Yes. They're equally balanced. >> Jian Zhang: Yes. >>: And you take some nice blocks, the labels are going to average to 0 just as to time. >> Jian Zhang: >>: Yes. So what are you learning? How does that help? >> Jian Zhang: That's a very good -- I'll talk about it actually. This is a very good question. I'll talk about it. In fact, it turns out the classification we don't do compression this way. That's very good question. But regression, yes we do it this way. Now, if you look at this, you say that these two cancel and you actually didn't get anything, this is point list to do. But once you realize that it's because of those two effects. Now the natural question to ask is can we keep the good thing and get rid of the bad thing? We want to keep this, because this makes our data better. You have more signal in your Y, because you take the average. But you make the X worse, because it's not as spreading out as it was before. Can you do something better? Now, if you think about that, the answer is, yes, can you do this. How do you achieve this? Well, here's the idea. When you average the data, instead of taking random samples, you only look at local neighborhoods. When you look at local neighborhoods, you take average of Y. You still get this bad, the good part of reducing the variance of Y. Now, because you only take local neighborhood of X and do the average, what will the data look like after you take the average of each local neighborhood? They will spread almost the same as what they were before, and that, if you have the same kind of spread, that means the variance is almost kept the same as before. So essentially what I'm saying is if you take local neighborhood, you almost keep the variance, almost, and you reduce this guy and this whole thing gets improved. That's essentially -- I hope that's an intuitive example to show why this will work. >>: Let me try this. So here this is a Lees squared again. the part, the Y [indiscernible]. >> Jian Zhang: Lees squared on Yes. >>: So what you've done is you've taken actual labels and averaged them and you're predicting average labels. >> Jian Zhang: >>: Yes. How does this -- test phase you're going to want to predict -- >> Jian Zhang: Yes, exactly. But next I'm going to show after this one works in the sense that you do the risk minimization and you can also get quantities about estimation measures, yeah. So now here comes the analysis essentially for the more general framework. The previous one is very intuitive example. But as mentioned by Chris, there's all kinds of problems, like how do you do classification, why do you predict the average and what if I used different loss functions and so on and so forth, what if I don't use a linear model but I report it in space. So here comes the setting. And for simplicity we're going to assume that when you take the average each group is the same size it really doesn't matter if you make it more general. It's very simple. Just make notation simplified. So this is a risk. RX is a risk. This is a guy you want to ultimately minimize. And RN is empirical risk for the original data. And RM, R tilde M is the risk for if you were using the compressed data. So these are the quantities we're going to use. Here are the assumptions. Let me quickly go through the assumptions and some are standard and some are I'll give you some motivation about why we need that. The first one is we assume H [indiscernible] space, and this is very popular choice in machine learning, of course. And with kernel and we need this assumption. This is often needed if you do look at it how people are deriving things. And this is the diagonal for any X is bounded. And we define this to be the Hubert knob square. This is not essential. You can change this to something else. This is not really important. And the second assumption is we assume loss function and the functions in H ellipses continues. This is often the assumptions you that you see. And many of the loss functions, they satisfy this one, or they could satisfy some restricted Lipschitz continuous which means it's Lipschitz within range Lees squared essentially that does not make much difference. You can do that with a little bit more technical details. And the third one -- okay, so this is the one about the weight. How do you decide what kind of weight should you use. And this one is saying that I'm going to take simply the average in this case, actually so I'm saying that if XI belongs to the set you take it to be 0. Don't contribute anything if I'm not in your set. Otherwise I just contribute equally. Now, this is not essential, I want to say. This is just to simply notify the notification. You can make those arguments asymptotic as well. You can change that. It des not change the result. The crucial thing is the following. Usually M goes to infinity. If and M divided by N goes to 0 as N goes to infinity. This is the typical kind of assumption you'll see in nonparametrics if you do nonparametrics, that's the standard. The false assumption is probably the most important one, I would say. The false assumption is saying that this quantity is okay to N. So what is this quantity? This is saying that again if you take W to be average weight, this is essentially the thing, the average distance between X, between pairs of X inside of each set. The D is some distance measure you pick up. You really know them. You can also use some others. This is the average distance of all the X belong to the same set. Actually it is big O pita N. This means essentially typically we mean we want our N to be a sequence converged to 0. This means that the average distance, if you can -- you can call it the diameter. The diameter will converge to 0 as you get more and more data. So that's just a sequence of numbers converged to 0 as N goes to infinity. And the fifth one is actually -- is actually quite intuitive. We define something called sigma H squared which is the difference between the expected value of the loss using Y, minus the loss using expected value of Y comma X. What does this mean? Assume this is finite. What does this mean? This is the reason I call this sigma H squared is the following. If you take loss to be squared loss, you take H to be linear function, this is exactly the sigma squared is the noise level. This is the unavoidable part of the noise. You cannot do anything better than this. This is a part you have to have. And a finally we assume this Y is bounded in this case. assumptions, okay, yes -- Now, given those >>: The Lees squared tempo, the reason why this works you assume some IID. Assumption five is hiding here in some sense, right? >> Jian Zhang: >>: Yes. So we're not IDing a previous case you wouldn't have the average -- >>: >> Jian Zhang: If you average they are all the same if there's correlation not properly correlated you will get some, but not as big as if they were independent. >>: And in this more general setting, that's assumption five essentially is capturing that information? Kind of independence of some notion. >> Jian Zhang: Yes, in some sense, how expected value, compared with this one. about two problems P1 and P2. Remember thing you would like to do, if you were the saying we will do after we compress well this guy can do, conditional Yeah. Okay. So the first result is we have problem P1 which is original able to do the computation. Now P2 is examples. How well will P2 do according to P1? Now this result essentially gives you some closeness measure about the problem P1 and P2. So essentially this is saying that under previous assumptions we assume the ways are choosing independent of Y, conditional X. This is implicitly assumed in the previous case, function of X, not a function of Y. Then we have the following quantity. This is saying that we opt for, you get a uniform bound of this quantity. So uniform over H. And this is the objective you would get if you plug H to the original, the empirical loss. And remember this guy does not involve the penalty term. It's only the empirical loss part. This guy is saying what kind of loss you will get if you apply H to the compressed examples. And this sigma H is what we defined earlier. It's constant. It's constant if you fix H. I should say that. And then this guy is in this order, in this OP order. So this is essentially the result. And the order is about related to M and daughter N. Daughter N is again how fast the diameter of each group converged to 0. So here are some remarks about this result. So first one is we didn't make any parametric assumption about PXY. So it's fairly general. And the second one is, of course, it can be applied to both classification regression. I didn't see where it has to be 01 or real numbered. It can be applied to both classification and regression. The third way is related to the question Chris asked. So there's a gap sigma H squared depending on H. Essentially this is the bad news. This is saying the following: If you are using the compressed example to do the learning, you won't in general gather the same thing, because that's not the same as this. There is a gap. Now, if this gap is a constant, you are fine, because you do optimization, you add a constant minus constant, you get the same result. Same order. You expect the same thing. If this guy depends on H then that's bad. That is bad. But there are some special cases. Actually, if you do a little bit of study you will find, for example, in the previous case, if we have a squared error and you take a linear form, then this becomes constant, sigma squared in the noise. Does not depend on H. If that's the case you are fine, you are good, you can use this result. What if that's not the case? How do we handle for general loss functions, for general convex loss, and how do we handle this? It turns out that given this result a simple modification can solve it. Here's the modification. Now, this assumption is really nothing more than that how do you do for classification, because for classification it does not make sense, as Chris mentioned, that you average the possible negatives. So what are you trying to learn? This is saying when you do, for classification, you can actually do the compression condition on the class label. So you look at [indiscernible]. connect, compress it. Essentially. You don't mix from different classes. You >>: Can I take you back to regression for a second? Can you go back to the theory you proved? Seems to me like you're losing a lot. So if you compare this to you mention the stochastic approaches. So to do the compression, you still have to touch each example at least once and the stochastic that does one pass of the data will again touch the data just one time. So you're not really gaining anything computationally, where we know the stochastic updates will have a convergence much faster than 1 over square root. You'll have 1 over N over 1 squared, so you won't have to do multiple passes on the data that's very big. >> Jian Zhang: First I agree with you the stochastic method is very scaleable and is efficient in terms of learning. And there are certain things, advantages of this one, which I was thinking of mentioning at the end. But the thing is, first of all, when you do this stochastic, you do need to keep the data, essentially, when you do for the prediction. If you're really doing non-parametrics. The second thing, on the other hand, for this one, if you really need to -- all you need to start, even if you do non-parametric, all you need to do is store small sub sample, M samples compared with originally if you want to compress, to store the original examples. The other thing is, actually, there's, when you say that the stochastic gradient is efficient, typically for certain problems, it is fine to just go one pass, but -- if the goal is learning, but if you look at optimization objective you often need multiple pass of the data in order to get a good result for this case. >>: But your algorithm is an aggressive algorithm that has this 1 over N or 1 over N squared rates. And it's huge. Why would you do that. >> Jian Zhang: I see. Are you talking about this Nestorov [phonetic] type of thing. When you see 1 over N, 1 over N squared, you mean the loss of the, you mean the loss of the ->>: 1 over N. Log over N. Think of just stochastic gradient. [indiscernible], your rates are 1 over N which are already very, very fast. N is a million, with 1 over N you're very, very, converging. If >> Jian Zhang: When you say 1 over N, you mean really the objective covers the optimum objective in the read of N over N. >>: Yes, which is what you have here, you have R over N. >> Jian Zhang: Right. Here you're thinking the -- you plug in the solution and to the R you get this result. Yeah. I agree. Essentially, but this, as I mentioned, that in that case, when you do the prediction, you actually need to store a whole bunch of the data if you're doing nonparametrics for the regression. In this case you don't need to. using this approach. >>: Why do you struggle with it? >>: What's that, I'm sorry. >>: Why are you storing all the data? This is one of the advantages of I can see for the final reference -- For central or something. >> Jian Zhang: If you are really doing some nonparametric, essentially, then you need to store, let's say, if you're doing a kernel learning, you store the data corresponding to the after you turn on zeros. >>: [indiscernible] doesn't it. >> Jian Zhang: >>: I'm sorry, what's that. [indiscernible] doesn't store it in training data. >> Jian Zhang: For the triangle you already assume the function is linear, right? I mean the W. So essentially you put a very strong parametric assumption there. >>: I can just enjoy that, I mean it's old-fashioned and not very good. nevertheless, use something that doesn't store parametrics. But, >> Jian Zhang: Yeah, could be. In that case, I'm not sure. If you don't think you need to store the data, then I think the stochastic is something that is very, I guess, is very promising candidate for this case. But this one, I think the other thing it can be -- in some sense this method is orthogonal to what people do for online or stochastic gradient. So in that you really can combine those two methods. There's no, nothing that you have to use this one based on the other one or vice versa. Okay. So this is essentially saying that you have a gap between those two objectives. And if you are doing the classification using by compressing both possible separately, you essentially have the following result. Okay. This is saying that if you are doing compressing the classification example in this way, then the difference between two objectives actually there's no gap. So gap becomes 0 and those two get in close to each other very well at the rate of daughter N. So this is essentially the result if you are doing for classification. And, of course, it automatically follows that the risk of -- the risk of converging to the same thing. They will be able to converge to the same thing since there's supreme bound above the two differences. So how about the classifier obtained? Because essentially when you do this minimization, you are trying to make sure that the risk is minimized. It's not really about the objective. So those two, if you plug in the previous result, combining those things, is not very hard to come up with this result, which is saying if you define H star to be the best-in-class, the best one in your class, that you can achieve the minimum risk, then essentially the excess risk of this one using the, you obtained using the compressed example, the classifier, minus the optimal in the class is going to be this quantity. So this is the result. Now, it will be nice to compare this with the standard result. What if you get using just N samples? If you just use N samples, this is the result you get if you are using the same assumptions. And this is something stronger. It gives you like the textile, the tail of the probabilities for this one. You can also get a similar thing, but then you need a very stringent assumption about probably not realistic about how to group it. So I give the writ [phonetic] here. But if you look at the writ here, the convergence writ for this one here is N to the minus 2. Here is the lambda you can choose it. So you can set lambda to be in the power of minus F and the daughter N actually is the writ of the diameters. >>: So it's H star [indiscernible]. >> Jian Zhang: >>: How about 0, or is that just a typo. >> Jian Zhang: >>: Yes. Which one. On the left-hand side of the bottom expression. >> Jian Zhang: This one. >>: Yeah. >> Jian Zhang: Yeah. No, it's a supreme over all the H. Sorry. This is a typo. Sorry. Yeah. So this is essentially the R star. That should be the R star. Yeah. And so here is some discussion. If we take lambda M to be this, this guy, N to the power minus half then this converges to rate at maximum of these two quantities. And consistent, of course, automatically follows, and one thing we want to emphasize is in practice often there's a limit on M, depending on what type of computer you use, whether it's personal or server or some small device. In that sense if you have limit of N the daughter N can be well contoured. Now, if you at most take like 10,000, you can in fact make daughter N to be small by throwing away those things that don't include it. In some sense you are not forced to use all the data, essentially, if some of them don't apply. And also there are some possible improvements. The rate obtain in theorem 2 in fact it can be improved for specific loss function. At least for the hinge loss. Now for square loss I can show that's the rate you get. You cannot further improve. But for hinge loss, you can actually improve that, because of particular piecewise structure of the hinge loss function. And this is about the risk. I mean, most cases you are interested in the risk estimation. And but sometimes you also want to ask a further question, which is how about the convergence of the classifier to the H star, the best you can achieve in class. With respect to certain metric, of course. The LM metric, LP metric. And so in order to get such result, you typically need some identifiability conditions. Because if just the risk converging, you do not, in general, guarantee the convergence of the argument. So what makes the argument converge? This is very popular condition proposed by Sebock [phonetic] 2004. Called the low noise condition. So basically this is saying that the probability mass, if you draw decision boundary, the probability is small. And it's less than T of average. This is a very nice observation. This is a very nice because this one I think in some sense it points out one important between you do regression and classification. the bayesian to the power condition, difference If you think about regression, as we know that you can use the same thing. Like Lee squared for regression classification. But when you do it for regression, you look at the boundary. Regression is something that, once you draw a certain boundary, now the density for points around the boundary will be high because you have this assumption. Now, this is -- but this is not necessary if you think about realistically, what will the density look like. This might not be the true case, the true case it might be like this. There's low density about the region close to the point identifier boundary. If this is the case you can make this assumption. And under this assumption it is not very hard to show actually you can get rid of convergence of each M tilde to the H star with respect to L1 metric. >>: Is it the true density of the estimator. >> Jian Zhang: The tilde is the true condition density of the Y to the X. This is the true one. And in this case I think it's about the rate of convergence for the risk to the power of R divided by R over 1 minus R. And from that, the next question we want to answer is how do we obtain those groups? And can you find a very efficient [indiscernible] because if this takes a whole lot of time, then it may not be worth it if the argument is about computation. So can you efficiently obtain the local groups. Here is a simple algorithm we use, actually. This turns out to be essentially a one-step key means. And I'll talk about the difference between the algorithm we use for grouping and for clustering on the next slide. The basic idea is you randomly select a seat among the examples, from the examples. Sale ten million, you randomly select 10,000. Those will be [indiscernible]. Then you calculate the pairwise distance between the theta and the example and put the example to the closest state and you take the average. This is the weight function you specified. This is how you do it. Essentially you do a one step [indiscernible]. It's very simple. And if you are using that method, okay, and you, with some other conditions you can show that if you are doing that, then indeed the average diameter of each group indeed converge to 0. In the probabilistic sense. So if the support is bounded support in the RP, then the rate actually -- this is average distance. It indeed is big O PM to the power of minus 1 over P. So it did get worse when you have higher dimensions, but it will converge to this in this case. As I mentioned in the early case, in practice this always improves over the random down sampling. The reason is you do not necessarily need to classify it to put everything into each group. You can stop if you feel that if you have limit on M you can stop whenever you feel that the distance is large. Now, the previous algorithm is very simple one. The time comparison, if you look at it, is simply M multiply N, which simply means if M is large and N is not small then it takes a lot of time to do the computation, to do the distance calculation. And here's one algorithm that can perform the computation much more efficiently. This is simply a hierarchical version of what we used earlier. So the idea is you randomly select some state at the top the root node and then you do that based on the distance you root them into let's say ten nodes and for each node you do the same thing again. Eventually the number of nodes for each inner node is MI. MI is -- and you only need to choose MI such that the product is M. So this is all required. Now, if you are doing it this way, eventually the end, the leaf node, you get small groups of the observation, then you can compress them. You output the compressed example. Now if you compare those two results, so the first one, as I mentioned, the complexity is M multiply N. And the second one is essentially N multiply M to the power of 1 over D. So this can be much faster if M is also not small number. And in practice actually what we found is for very large -- even for very large dataset you can simply take D to be 2 or 3. That's good enough to give you efficient grouping algorithm. And here's some comparison to the comparison is simply one step key here, the difference for the cost classifier, typically you want to data points send spot. algorithm. As we mentioned the first means. So now the difference essentially mechanism is not this. When you do find do partitions. That group all closed by Now, when you do this, all these require, if you look at the assumptions, is that we require the diameter of each group to be small. We didn't see anything about overlapping. So in other words you could actually have groups have sets that are on top of each other. Now, as long as within each group is small, you are fine. It does not matter that you have overlapping. So that's actually -- that's one of the reasons which makes it possible that we can do it in one pass. And also it can be easily computed in a distributed way, because you simply just put it into several chunks and you do that together and that would be fine. Because for each of that at least you guarantee that the diameters of each group is small. There's no requirement that you have to find all the data in the neighborhood and put it in one set. Okay. So here are some experiments. Yes? >>: I want to ask a question about your analysis again. So anywhere in the analysis did you assume that the true function that generates the labels belongs to the class that you're searching in, belongs to the covered space? >> Jian Zhang: No. >>: Is that somehow implied in one of your assumptions? agnostic in your assumptions. Are you totally >> Jian Zhang: I don't think that's assumed, because here we are focusing on the estimation error part. We don't touch approximation. If it's in the class, you have approximation error large. That's something in some sense you can't control. You have to pick up the function class which is large enough to make approximation error small. It's mainly about the estimation error, the analysis. >>: You can reason about the estimation area either with an agnostic assumption or not agnostic assumption. Agnostic assumption would say I can use anything or you could use the fact that the hypothesis ->> Jian Zhang: If you know that fact, I think you can do something better. For example, you can use the local market complexities and make these bound, if you derive bound you can make bound sharp area. In the case of estimation, that's where it can greatly help essentially. If you know the true conditional probability density is well conditioned in the sense that it really has low noise with certain upper values, then you can get much faster rate of convergence for the estimation even though you have the same rate as convergence for the risk. So the method we're going to compare are the following. So this four is you basically train using all training data you have. And this dye is your sub sample subset, and you trend the algorithm classifier. Now the comp, the compression, the first one, is you do this -- you construct compressed example but you only use random grouping. So you randomly select a subset and you average them. And this one is you do the local grouping, use our algorithm. This is down -- when you are doing classification, this is down for positive and negative separately. And then we also constructed, tried to show what if you're doing the grouping for all examples together? You don't separate them. So what will happen? Although the theory does not suggest in doing that, but we also did this anyway. And the dataset we used are one simulated data set for UCI and two large scale datasets. So the first one is -- you know this is the simulated, so we know how the data looks like. We generate data based on this logistic model and we simulate XY, XX and Y giving the true parameter error. Here you can cut to the error which is .246 and we're trying to compare this solution of the performance of four algorithms using different NM setting. So N will be the full set and will be how much it compressed you or how much you want to sample for in the case of down sampling. We try to keep M a small number in this case. You will see that the base arrow is pointing to 246. The four, of course, is the best performing -- if you were able to conditionally use all the dataset. This one actually gets pretty well if you use a local grouping, it gets pretty well when you compare with this one or also compare with biggest error actually. Even if you have just M, you cause 20, as long as you're allowed to do a lot of compression, you're doing very good in fact in this case. The down sampling is not performing grid because you're losing a lot of information. In this one if you don't care about wire labels, in this case it also converges, but it does not do as well as if you separate the Y labels. So this is a very simple toy example I want to show. So indeed it performs quite well. And we also try this for these datasets. So here are some statistics. And those again are not large datasets. Those are pretty small, at most medium-sized. We use labor as well for this set with RBF kernel and do a tenfold cross-validation. Because the dataset is not large anyway, we set D equals 1. That can easily handle it. And we change the setting of M over N. So this can be thought as the compression ratio. How do you want to compress it? It's from .1 to .5. And here are the results for the first two so the last one is E. Coli [phonetic], the second one is glass. performance. You can see the The bottom one, this is the one you obtain using the full set. So this is probably -- you can see this as the upper bounded, you can achieve the performance, use the whole set. And in both cases, in this case, well, the compressed with local grouping does the best among the other three. And this one, even though for the linear regression case, it does not show any advantage, but for this case the pink one is the random sample A. For those two datasets it performs better than you do a down sampling. But still, of course they're not as good as if you compressed using local grouping information. And this is for the other two datasets. Now in this case, for this one, actually, this is interesting. This is, I think, is the spam-based dataset. And this one, in this case the down sampling is performing better. So actually it's pretty close to the local grouping, except for the tail part. And it's also interesting to see actually this one it does better in the end when you have like a .5, which means you compress every two into one. And it does actually better than if you use everything here. >>: Is it a pretty unbalanced dataset. >> Jian Zhang: >>: Which one. Is it a pretty unbalanced dataset. >> Jian Zhang: This class base? The front base I think is two classes. can't remember exactly if it's very unbalanced or not. >>: Not unbalanced but would you want to adjust the grouping methodology to -- >> Jian Zhang: >>: I To have a different work group size. Possibly, yeah. That's one way to do it. >>: Jian Zhang: I think it's possible, in practice, at least. tried it the same, for both positive and negative. I think we >>: I feel like if you have a rare class, compressing the rare class is much more costly than compressing ->> Jian Zhang: Yeah, I agree. And this is the result. This is a result applied to two, I would say, large datasets. So the first one is image dataset. It has about 75,000 images. And from some broadcast captured from some broadcast. And the task actually is trying to find the banner label, whether there are -- person appeared in the images. So this is -- and the second one is the after dataset. This has half million training examples. This is the one actually we took from the machine learning large computation challenge. So this is one of the datasets they used for the challenge. And here we use what is suggested from the large-scale challenge. We essentially report the classification rate, we report the area over the precision recall curve. So this Y will be the area over the precision recall curve. And this is actually the result again for the format search and here because the dataset is quite large we can actually use compressed quite a lot. This one, for example, is found .001 to something like .1 or .2. This is from .01 to .2 in this case. So as you vary this, this one, you can see how relatively those methods are performing comparing to each other for this dataset. This one I think is a little bit weird. I think this dataset, I expect this data is some fake one. Is probably just the randomly generated because they don't mention what word does this one come from. >>: Are the groupings, other than the class label, are they identical for the positive/negative region? In other words, is the partition essentially I choose this block of space and I'm going to have one group for the positive data average out in that part of the space and one group for the negative data, or can they just be disjointed at the positive representative that came from this point and the negative came from this point and they may or may not overlap. >> Jian Zhang: I think this depends on the data distribution, in fact. depends on how the data looks like. They can overlap. >>: In your experiment, what did you do? you're able to recover. >> Jian Zhang: Yes. It can overlap. Yes. This It has implications in terms of what The algorithm -- for the algorithm it can overlap. >>: You're actually partitioning. >> Jian Zhang: I did not hard look at the partitioning because it's not low dimensional space. For example, if you have 150 features, it's really reaching the high dimensional space. >>: How did you define the partitions together. >> Jian Zhang: The partitions we defined is essentially if you think about the one step K means, if you have a meeting, you want to come up with 1,000. You randomly sample 1,000. And then you group them. Based on distance, you group them. This way you can do this in a very efficient way. Okay. To conclude, so the goal here is we want to reduce the training test time and the storage for learning with massive dataset. And for this method, actually, it's very easy to compute, and also you can use existing packages. You don't need to change anything, as long as you come up with the compressed example. And you can obtain the consistency and rate of convergence result. And we showed both empirically and theoretically it could be much better than the random sampling. And this grouping can be done very efficiently and in a distributed way in fact. Even though we didn't do it in a distributed way there's moving in our case you can do it in a very distributed way for a very large dataset. Finally, as I mentioned earlier this could be combined with other methods to really handle very gigantic datasets. If you really have such datasets, because those methods are kind of orthogonal to the online versions, like online stochastic gradient because you can compress that and apply those methods any way and they don't conflict with each other. So that's all. Thank you. [applause]