16051 >> Ofer Dekel: So it's our pleasure to host Nati Srebo today from TTI in Chicago. Nati earned his Ph.D. at MIT. After that, he spent some time as a visiting researcher both at the University of Toronto and after that at IBM Research in Israel, Haifa Labs. And now he's a professor at the Toyota Technological Institute in Chicago, and he'll talk about his work which won the Best Paper Award this year at ICML this year. It's guaranteed to be enjoyable for everyone. >> Nathan Srebo: Thank you. Okay. Hello. So what I'll be talking about today is actually two different things: Mostly I'll be talking about training support vector machines. I'm going to explain exactly what specifically I'm going to, what support vector machines are in particular in a bit. Mostly talking about training support vector machines then probably the last maybe quarter of the talk, maybe a bit less, also talking about clustering. And both of these I think eliminate fairly different aspects of an underlining theme which I find is a very interesting and important theme. And I'll return to it then, which is looking at interaction between computation and data and, in particular, seeing how computation becomes easier as we have more and more data. I want to point out this is based on several pieces of work. So the -- mostly with Chesof Shvelts (phonetic) and also (inaudible) and previous work with Greg Shenolovitch (phonetic) and Sam (inaudible). So I'll start with support vector machines. Quick reminder. I'm assuming that most of you here have seen SVM in some form or another. From my perspective really SVM is just linear aspiration. We have a bunch of training data points. Based on these training data points we want to make future predictions whether the future point is green or red. We do that by finding the linear separator, not just any linear separator. Linear separator with the low margins, maximizes the separation between the separator and the data points. And if I look at this (inaudible), what this means is if I characterize the linear separator with its vector that's normal to it, inner product of this vector with all the points has to be more than the margin or less than minus the margin, and equivalently we can renormalize things instead of talking about a unit vector require the margin to be one and what corresponds to the margin is just the norm of the vector one over the norm of the vector, which means that we can -- sorry. Also we maybe can't hope to get or want to get complete separation. There might be some noise. Might be some points that we can't quite separate and that's okay. We're just going to pay for them. And how much we're going to pay for them is just how far they are from where they're supposed to be. For example, this red point that's on the wrong side of the margin, we just pay this much. Okay. So what this means is that basically we can think of this SVM training as a problem of finding this best separating large margin hyperplane is a problem of one maximizing margin, which means minimizing the norm of W. On the other hand, minimizing the error. And, again, the error is this so-called hinge loss which just looks at how far points are from where they're supposed to be. And the end result of all this quick story is that separating is that training support vector machine essentially boils to solving convex optimization problem which is minimized on the combination controlled by organization parameter here in this case (inaudible) lambda of the norm of W and error. Okay. So, again, I presume that many of you have already seen such presentations of SVM. And so we're left with this optimization problem. How do we solve this optimization problem. So this is just a convex optimization problem. It actually can be written as a quadratic program. And we can throw it into QP solvers. So straightforward optimization methods for this problem based on popular methods for solving quadratic programs, interior point methods, would take times that scales roughly fourth or maybe into the three and a half maybe, with a size of the dataset. So this is fine for solving SVMs for size of up to maybe 100 variables or so. And there's been much improvement of this. So the most popular methods probably now for dealing with the support vector machine optimization are based on decomposing the dual problems about SMO or other approaches. And they can get the runtime down to scale quadratically with the size of the problem. So we have the quadratic piece of the problem with the size of the problem. The cost of only linear as opposed to quadratic convergence. So dependence on log accuracy as opposed to log-log accuracy. And more recently there have been methods suggested. This is specifically for linear problems, which scale only linearly with the size of the problem. So all these methods look at really the problem's optimization problem. And all of them tell us that when we have more data, then we should be expecting to work more. And they sort of just try to push that down to be this -- we're going to work more but maybe not as badly. So instead of having a fourth order increase, maybe we can go down to only linear increase with the size of the data. But in any case, we study the problem here as we do for most problems with machine learning and computer science. In fact, we look at the runtime as an increasing function of the size of the dataset. And my main point I'd like to make is this is -- this study seems somehow wrong. So the runtime should not increase with the amount of data. So let's look at what's going on here. So what do we use SVM for in the first place? We use it in order to get some predictor that will have good error when used in some tasks, character recognition or whatnot. We don't want to do character recognition. We train in SVM and use it as character recognition. Suppose that given 10,000 examples, given 10,000 examples, work for an hour, and produce a predictor that makes 2.3% error on my character recognition test. So that's great. Now you give me, and maybe I'm happy with this 2.3% error. Now you give me a million examples, give me 100 times more data. So what the traditional SVM optimization approach would say is now I'm going to crank this through my optimization solver. Scale it best linearly and more realistically quadratically with the size of the training set, which means instead of working for an hour they're going to work for maybe a week. And after that week, okay, I will get a somewhat better prediction. Instead of getting 2.3% error, I'll get 2.2.9% error. But if I was happy with my 2.3% error, there's absolutely no reason for me to work for this entire week. Definitely what I can do is just sub sample. Take only 10,000 out of those one million examples. Throw away the other 990,000. Do exactly what I did before: Work for an hour just like I did before and get that 2.3% error. In this general argument, generally apply. So when studying machine learning problem with respect of how much work we have to do in order to get some performance. And performance here is target accuracy. How much work you have to do. The amount of work should never scale up with the amount of data that I have. In fact, what we'll see is what we can hope for and can get is it should scale down. Instead of just sub sampling, I can do something more sophisticated, use those 10,000 million examples, not throw them away. Instead of working for an hour, work for only ten minutes and get the same 2.3% error. So what we're going to see is how to look at runtime as a decreasing function of the amount of available data. How we can use more data in order to make training take less time. But in any case, what definitely I hope I've convinced you is that you should definitely not study runtime as an increasing function of the amount of available data. At the very least it should not increase. Now, you might be thinking, okay, you don't really buy this because maybe you really do care about that .01% increase in performance. Maybe it's really important for you to get better performance, to get better predictive accuracy. That's fine. But then you should study how runtime increases as a function of what target accuracy you want. That's fine. If you want more target accuracy, you'll have to work harder. And that's fine. And you should -- so if you tell me I want you to work harder, so if you tell me I'm going to take more time because I want higher target accuracy, that's fine. But if you are saying, no, I have a large dataset, I am going to have to work harder. That's wrong. Some problems are really hard. Some problems really do require to work for, to crunch for a week. You might have a problem that's so hard really you won't get anywhere with crunching for a week. That's also fine. But in that case, what you really should be studying is how runtime scales with the hardness, inherent intrinsic hardness of your problem. So particular SVM problems, large margin problems, maybe this hardness is captured by the margin. If you have a very, very small small margin, yes, it's harder to learn. In that case you can say, okay, if I have a small margin, I have to run for more time. I want to look at how runtime scales, increases as the margin decreases, or the norm of W increases. But, again, just saying I have more data, because I have more data I'm going to run for more time, looking at the runtime dependence which is increasing scales function of N is wrong. So we'll see this particular for SVM. Let's look back at SVM training and look at what's going on here. So we wrote SVM training as an optimizing an objective. And the issues that all the traditional runtime analysis is in subtle runtime analysis, also the traditional optimization methods for SVM really treat this optimization problem. And to get a very accurate solution on optimization objective. How the runtime I showed you before, say how much runtime does it take to get within epsilon of this optimization objective. That's not really what we care about. It's true that we optimize optimization objective, but that's not what we want to minimize, what we really want to minimize is our true objective which is prediction error. You want to get low error on future examples. So why don't we just minimize this directly. When we can't minimize this directly because we don't observe our future error. So but that's what we really care about. So instead of minimizing the future error, what we actually do is minimize some surrogate written in terms of the training error and randomization and so we minimize our optimization objective as denoted here by F. Doesn't mean that when we study runtime we should study it in terms of how low we get F. Because I don't really care to get my optimization objective really, really low. What I care about is getting my predictive error very, very low. What we'll do is study optimizing F in terms of how well do we do on the error. So, in particular, we're going to look at how the computational cost of optimization as we said before increases as a function of the desired generalization performance. So it increases not in terms of the epsilon F in terms of how accurate I get F, but it's going to increase in terms of how low do I want to drive my error. It's going to increase in terms of how hard the problem is. In terms of how small the margin is. But it's going to decrease as a function of the dataset size. And let's see how we might hope to get that. And in order to do so, it will be useful to look at the error decomposition. I'm sure many of you have already seen the standard error decomposition machine learning. And usually in learning tasks we talk about, we decompose there into approximation error. So this is the error that kind of says, okay, even if we had tons of data, this is the best error that we can get with hypothesis in our hypothesis quest. In our case, you can think of it as the best error achievable by a large margin separator. It's actually a bit trickier here, since we're talking about regularizer and not strictly limiting a hypothesis task. But we can define things appropriately. The error achieved by the population minimizer, if you had absolute infinite data what would be our error. This is in some sense unavoidable error in a sense. And on top of that we had estimation error. And estimation error results from the fact that we don't have infinite data. We're only doing everything based on an estimate of the error which was given by the training error. This is a standard decomposition, I'm assuming many of you are familiar with. But, really, this decomposition refers to this W star which is the mathematical minimizer of the SVM objective. But we never have the mathematical minimizer of the SVM objective. After all, we run some SVM optimization. And it runs for some finite time and gives us some predictor which is finitely close to the true mathematically defined optimum. So really what we have to add here is another error which is the optimization error. And this is the difference between the error of the mathematical optimum and the error of what our algorithm actually returns. And I should point out that very similar composition has been suggested, talked to us about a year ago by (inaudible). Okay. So now, okay, so this is the composition. Let's think about what happens when I have more data, when the training site increases. So when I have more data, okay, the approximation is the same but the estimation error decreases, right, because estimation error depends on the size, the more data I have, I can do estimation better. The estimation error goes down. But recall that our target here is to get some target performance, our desired 2.3% error. If estimation error decreases, I can -- in some sense the mathematically defined optimum is better, that means I can do a much worse time optimizing. I don't have to optimize as well. I have so much data that I can just sort of, if I really optimized it all the way, maybe I can get lower error. But I can afford to just do a bad job optimizing. What this means is in particular I can take maybe less time optimizing. If I have theoretical optimization methods, most of ours do, I can run for less iterations. That's good. That means I can run in lower runtime, which is what I want. However, with most optimization approaches, the situation is a bit trickier. And really increasing dataset size is a double-edged sword. On the one hand I can work for less iterations. I can do a worse job optimizing. On the other hand, each optimization iteration takes much more time. So traditional optimization approaches, each optimization iteration I look at the entire dataset, maybe do something complicated with the entire dataset. Each optimization iteration by itself scales linearly, maybe quadratically, with the amount of data. Overall, I'm not going to win here, at least not win much. Do less iteration, but each iteration is going to be much more expensive. So this is why we don't get this type of behavior with standard approaches but we do get this if we move to stochastic gradient descent. Last year we presented specification to stochastic gradient approaches, which I'll present on the next slide, called Pegasus. And the important point about Pegasus or any stochastic gradient descent optimizer for that matter is that the runtime for each iteration, you only look at the single example so the runtime of each iteration does not increase with the number, with the sample size. So once we can optimize sort of, we can do worse job optimizing, we need less iterations, the iteration is fixed and we really do get a gain in runtime. So let's look at Pegasus, is Pegasus is a fairly straightforward stochastic optimizer. What we do with each iteration we look at one random training sample from our training set. We look at our objective limited only to that example. So this is what -- the error in that example plus the regularization term. We look at the gradient, actually sub gradient, but it's really not important. Think of it as gradient of this function, and take a step in the direction opposite the gradient with predetermined series of step sizes that behave as roughly 1 over T. So 1 over the number of the situation. So each iteration taking smaller, smaller steps. That's all. So each iteration is very simple. Takes random point at the gradient, requires only looking at the example and you take a step in that direction. So the cost of each iteration just scales linearly with the representation of the single example. And the major important result here is that in order to get within epsilon off the sort of -- off the optimum, off the objective function. So, again, this is now epsilon on the optimization objective, we only need one over lambda optimization iteration. So the number of iterations does not depend on the size of the training set. Which means that, okay, as I said before, the cost of each iteration only scales linearly with the size of each example, even if it's sparse it's even better. The size of the presentation of each example, which means the total runtime for Pegasus to get within epsilon on optimization objective does not increase at all with the size of the training set. It's just the dimensionality of the data, the sparsity of the data over lambda, which is the regularization parameter times epsilon. So what we see here is that we have no runtime dependence on the training size. I promised you decreasing the training set size. We'll see it shortly because remember this is runtime dependence for epsilon which is optimization epsilon. So before going on, it is important to realize here that Pegasus is also a good method. I mean I'm going to quickly flash a slide with some results with. Major point here is that Pegasus, these are some standard large sparse SVM training benchmarks and Pegasus gets significantly better performance than previous methods. But this is really not the major point I want to talk about. So let's move back to actually comparing the actual run times. If we look at the runtime of Pegasus compared to other methods, it's actually a bit difficult to compare the run times. So now I'm, again, talking about this traditional optimization runtime comparison. It's a bit tricky to compare these run times, because on one hand Pegasus doesn't have a scaling at all in the training set size with other methods have linear even quadratic or worse scaling. On the other hand, Pegasus has a very bad dependence on optimization accuracy on the epsilon, which is how close do I get to the optimization objective, whereas other methods have only logarithmic or maybe double logarithmic behavior. And also there's like this lambda, the optimization trade-off parameter comes into the runtime and I don't know exactly how to look at that. Is this better or is this worse? I don't know. And this is in a sense from my perspective, because as we said before, this is really runtime analysis in terms of the wrong parameters. These are not parameters of a learning problem. The optimization accuracy, the parameter of the optimization problem, not learning problem. It's not something I care about. I care about what error I'm going to get. This optimization trade-off lambda is some mechanism I use in order to balance the margin and error. It doesn't really tell me something about how hard or easy my problem is. So what we're going to do before -- so I'm still going to show you how we get a decreasing dependence on runtime. But before we do that, let's just try to understand these run times and we'll see that's going to be helpful for us also for seeing the decreasing behavior, and we'll try to understand it as we promised before in terms of the parameters of the learning problem, in terms of the desire generalization performance, desired error. Not the optimization epsilon, but the actual error I'm going to get in future predictions and in terms of the hardness of the problem, inherent hardness of the problem, how big a margin is it? How noisy is it? And in order to do that we're going to look at the following situation. So suppose there is some large margin predictor. I don't know it. I want to learn it. We're assuming that it is possible to separate the data with large margin and low error. In particular there's some predictor W not that has low norm, because pointing to large margin. And small error. And what we're going to ask is what's the runtime in order to get error, which is almost as a predictor W with error which is almost as good as this Oracle predictor we know exists. But, of course, we don't know it. What we're going to look at, we're going to study the runtime required to find such a W as a function of this epsilon. And the margin. Okay. Now, again, this epsilon is a very different optimization epsilon. This epsilon actually talks about the quantity that really interests me. The predictive error. Okay. In order to do that I'm going to go fairly quickly through this analysis. We can decompose our, the error of W. This is roughly the composition that we had before to the error so we're comparing it to the error of W not. So we're decomposing it to the error of W not, plus and then we're sort of adding and subtracting the expected objectives. What this boils down to is we have the error of W not. Another term here which is roughly the approximation error which depends on the norm of W not. Plus these two terms which replace the difference in the expected SVM objectives which is the difference in the actual SVM objectives of the two predictors, and the term that measures how quickly the actual SVM object approaches its expectation. Roughly speaking, this is an approximation error. And what's going on here? This is optimization error and this is estimation error. But the details here are not so important. Okay. Sorry. What's important is that then I can bound -- I can bound this term here. So what is this term here? This term here asks by how much bigger is this SVM objective on W than SVM objective on W not. Now, I know that if I actually optimize my objective, I will find a W that actually minimizing the SVM objective. So this term will never be positive. However, I don't ever completely optimize it. I only optimize it within epsilon. So I can bound this term in terms of my optimization accuracy. So this epsilon is the epsilon that measures how close I am to the optimum of F. And now recall what we wanted was we wanted to get a predictor W that has overall small error. And in particular it's error equal to the W not plus epsilon. In order to get that, we definitely have to bound each of these three terms by something which is order epsilon, because we want the sum to be order epsilon. And what this tells us is now we can go back and from each of these three inequalities extract bounds on what does lambda have to be. What does epsilon accuracy have to be, what's optimization accuracy have to be and how big a dataset size we need. What we see is in order to get such a predictor that we want, we need lambda to scale roughly as epsilon over the, epsilon times the margin square. Optimization accuracy to be roughly the same order of the epsilon we want here. And the dataset size to scale in the familiar way, grow with the margin squared and the target error squared. So what we can do with this now is go back to our traditional runtime analysis. Our traditional runtime analysis was in terms of the dataset size that we use and this optimization accuracy and this lambda and plug in these requirements on lambda on the optimization accuracy and the dataset size into our traditional runtime analysis. We do that, we get in runtime analysis, which now is in terms of the norm of W squared. You can think of as 1 over the margin. So the larger W, the larger the norm of W is, the harder the problem is. The smaller the margin. And in epsilon, which is our episilon on the error, how close are we to the optimal predictive error than we can get. So we got what we wanted, which is a runtime analysis in terms of the actual parameters of the learning problem. How hard is the learning problem and how accurate a solution do we want to the learning problem. And what we can see from this analysis, okay, is in fact that Pegasus has a much better runtime guarantee than other methods. It only scales quadratically, both on the margin and the desired error, whereas other methods have a fourth order or even seventh order scaling. So what I hope I convince you so far is only well maybe I hope to convince you maybe this is a better way to look at the runtime of SVM optimization and it provides a better way to sort of compare different methods in terms of how well do they actually deliver to us what we want, which is a good predictor. Looking at it this way, Pegasus actually does have significant benefits over other methods, which explains our empirical results also. But notice that in doing all this analysis, I got to set the dataset. So I chose whichever dataset I wanted. What does this mean? This means that this actually -- you can think of it as corresponding to the situation which I have unlimited data. I can always ask you for more data or I can have infinite data source and I can choose the optimal dataset size to work with. So I choose the dataset size that exactly suits me and gives me the error that I want. Okay? Now, so you can think of this as sort of a data laden analysis. This is the absolute minimum runtime you need if you had access to unlimited data so you could choose the work with how much data you want. But in reality we usually don't work in this data laden regime. We have some limited dataset and that's what we have to work with. So now let's go back and look at how runtime scales with the amount of data. So let's go back here and look at the scale runtime is a function of the training set size. And what the data analysis tells us, is if I have enough data, this is sort of the minimum I can hope for. The minimal data -- the minimum runtime if I have on however much data I want. Now, of course, it's not that with any amount of data I can get this runtime. In particular, remember that this is, this analysis is runtime to get a fixed predictive accuracy, to get 1% error. 1% predictive error. So there's some limit on how much training data I definitely need. And this limit comes from the standard statistical analysis relating sample size and predictive accuracy. In order to get my 1% error I definitely need some minimal number of examples. 100,000 examples or whatnot. If I have less examples than this, doesn't matter how much I work, I'm never going to be able to find a good predictor, just statistically it's impossible. So this is sort of my statistical limit. It measures just how much samples I need. If I'm on the right of this blue line, the right of this limit, if I worked hard enough, if I really optimized that SVM objective all the way down, I'll get the area that I want. But it's still very difficult to do that, because now if I'm just to the right of this, I really have to optimize my optimization objective really all the way down and do a really good job optimizing, which will take me a lot of time. What happens is as they cross this, more and more data I can do a worse job optimizing and gradually decrease this runtime. And we can analyze exactly -- okay. Sorry. Looking at the error decomposition, this point is exactly the point in which the approximation error and estimation error themselves already account for all the error I'm willing to tolerate. So there's absolutely no room for optimization error. I have to drive the optimization error to zero. As my dataset size increases beyond this point, there's no room for messing around. There's room for doing a bad job optimizing and taking less time. So in the case at least of Pegasus we can do this analysis exactly and just go essentially taking the calculation from before and putting it in the actual expression for the optimizationnal error of Pegasus after fixed number of iterations and get an expression for the runtime when we have only limited amount of data. It's square root of N data. And what we now get is an expression which behaves in the way we wanted it to behave or the way that makes sense that the runtime would behave. So it increases when epsilon is small. When my target error, when I want small predictive error I need to run for longer. We're fine with that. At the increases when they have smaller margin. Okay. So the smaller the margin that I know I have there, it's going to take me more time, we're fine with that. We know that's going to happen. But it decreases with the size of the dataset. So the larger the dataset I have actually have to run for less time. And this function -- so this is a theoretically known and it's shown over here on this decreasing curve which bridges between this statistical limit, down to the minimal runtime that we know that we have the data laden limit. So this is all just analysis of an upper bound. So this tells me, gives me an upper bound on the runtime and the question is does this actually hold in practice? And we actually observed this behavior empirically. So this plot shows what's the runtime training on the right dataset. So this is a fairly large standard benchmark data sets for SVM training. And we tried to get an error of, set the target error level to 5.25 which is a bit about less than a 10th of -- one less than (inaudible) above the best you can actually get in this dataset. And we look at their runtime to get this accuracy if we only train in a subset of the Reuters dataset. And what we in fact see is as predicted by theory, when we have more data, the runtime actually decreases. And you need at least for this specific example, we need to about double the amounts past the point that you have enough data to get the 5.25% error passed the statistical limit. You need about twice as many, that's all. So this example is only twice as many. The example is 10 times as many examples in order to be able to reduce the runtime. So, sorry, to be able to get to the (inaudible) runtime. In particular, when you have about twice as many examples we reduce the time here by about seven-fold. So you have twice as much data, it means you have to run for seven times less time in this particular example. So we see here that we get this behavior of decreasing runtime as data size increases, this is the birth, the behavior that kind of argues the behavior that one would expect. And as far as I know this is the first presentation that we see it both theoretical analysis of how and why and what we would get that, and empirical study actually showing that we actually get this in practice and it's important to, again, note that we're getting a decrease on the best method for this dataset. Okay. Now, I want to go back here actually and now correct something I said, because the situation there is a bit more subtle than what I said. What actually happens as we increase the training set size, not only do we increase the optimization error, but actually we also increase the approximation error. So when we have more and more data that's estimation decreases so we could have just kept the approximation fixed and increase optimization error. It's also better to increase the approximation error. What do I mean increase approximation error for a second ignore what's written here. At least intuitively the approximation error measures the error on the fixed hypothesis class, now considering the hypothesis class I'm going to consider a more hypothesis class. Smaller hypothesis class. Means I'm making my life harder I have less flexibility in choosing a predictor which means I'm going to have a higher approximation error. I'm going to do a worse job approximating my target. But the process class is smaller, which maybe means that it's going to be easier for me to run on it. It's going to be lower runtime, which is what I want to get here. So this is what's going on here. Here the situation is more subtle. What controls the approximation error is lambda. When lambda, when lambda is very large, it means that I'm very restrictive in what predictors I can choose and I have large approximation error. When lambda -- when lambda is very small, then it means I'm very unrestricted and I have high approximation error. Sorry. Low approximation error. So in particular what happens here is as we increase the training set size, we can increase lambda. So this is for those of you that might be familiar with results and consistency or just generally in training, you're used to thinking of decreasing lambda as the runtime increases and the reason we decrease lambda is in order to decrease the approximation error, because that's what we do if we want to drive the error to zero. But here what we care about is driving the runtime to be small. Within our budget of what training error we allow. And the best way to do that is actually to increase lambda, which means that we're doing more equalization but taking less runtime. Okay. So back to the main story here. Okay. So I have 20 more minutes or so. Okay. So I just want to, before going on, we said that for stochastic grading in the sense for Pegasus we get decreasing behavior. We can also ask what happens for other methods. So we can repeat the same analysis for dual decomposition method such as SMO and SVM perf. What we see in those cases is an increase in runtime, increase of dataset size. If I increase the size of the amount of data which I run the algorithm on, well, at first I get a very sharp decrease when really I'm allowing no optimization, no optimization error to allowing a bit of optimization error. But then very quickly the runtime increases. In particular, there's some optimal dataset size which going beyond that optimal dataset size is only going to do me harm. So running the data on larger and larger data actually, on larger and larger, the optimizer running on larger data takes more time which is maybe not surprising, matches a standard analysis of these methods and it's kind of what we're used to. If we use more data, then it takes more time. Again, this type of behavior is in some sense wrong. The same thing happens in SVM perf, which is cutting clean methods, that's important. Not quite as bad, we also see these behaviors empirically, so here we're comparing, we're seeing these two methods for the dual decomposition methods SMO here familiar with SVM algorithm. We see a sharp increase. It's very difficult to detect this initial decrease. And for methods like SVM for specifically for SVM perf cutting clean methods which is sort of much more have a much more nicer dependence on dataset size, at least according to the traditional approach. We do see an initial decrease then runtime starts increasing again. This is as opposed to stochastic gradient descent where you just continue to go down. Okay. So skip this. Okay. So let's summarize what's going on here. And first I have to make a confession. I really talked all the time about SVM optimization. Everything I said about Pegasus holds only for so-called linear kernel. So for the problems as they presented in the first slide which is we're given, we're explicitly given data, we're explicitly given features in RD. We're explicitly given our feature vectors, and then we get the performance guarantees that we get and everything is fine. You can run Pegasus. You can run stochastic gradient descent also when you use actual kernels, nonlinear kernels in the more familiar application of SVMs. But then things become a bit trickier. So in particular although you can run Pegasus and get the same guarantees the number of iteration. The cost of each iteration does increase with the amount of data because you have to represent everything in terms of the support. In terms of the support vectors, number of support vectors can increase and you get into trouble. So one thing that will be definitely interesting is to get this same type of behavior for true kernellizeded SVM. We're not yet able to do that. Although I do believe it is possible. But and the other thing is the decrease that we saw here in the runtime is really a result of the fact that we can increase the optimization error and maybe also increase the approximation error. It's playing on this error decomposition, and there's a limit on how much you can gain from that. In particular, at least theoretical analysis is almost tight. You cannot -- without assuming anything else, you cannot hope to get much better decrease in runtime beyond what we're getting, at least theoretically. And it will be fairly interesting, and I think very possible, to try to leverage this information more directly. So if you really have tons and tons of data, the answer is just sort of in your face. I mean you should not have to run -- it should really help you much more significantly to reduce runtime process. It might be possible to do so empirically and maybe also have theoretical analysis that are based on much sort of more complicated assumptions. Theoretical assumption analysis here is only based on assuming that there is a large margin predictor. Maybe if you assume a more detailed Oracle assumption, assume something about how that large margin predictor looks like or that you have many family (inaudible) predictors it might be possible to get even better decreases in the runtime, and more directly leverage the access information. But going beyond SVMs the basic story I told you is valid for essentially any learning method. Not only SVMs. If you have more data, runtime should not increase. And, in fact, it should be possible to decrease runtime when you have more data. And I'm looking forward to seeing optimization approaches for other methods, both studied in this way decreasing function of runtime or at least as a function of the true parameters of the problem, and not of the optimization problem. And hopefully doing that will also bring up optimization approaches that actually have a more, that actually behave this way and are more appropriate for learning. Okay. So this sums up the first half of the talk, which is about SVM optimization. What I'm going to do now is switch to something that will seem like a completely different topic, but I hopefully by the end you'll see that it's not actually a completely different topic. That's looking at the computational hardness of clustering. So what we saw in the first half of the talk is now if you look at SVMs, SVM is a problem, it's always in a sense it's easy in convex of problem. The question: Do we take a day or do we take an hour? It's just a matter of how we scale in the parameters. Looking at clustering, and here just to make things very simple I'm talking about the simpler form of clustering. I have data that's generated from sort of few Gaussian clouds and what you want to do is reconstruct those that separation into those clouds, into those Gaussians. So if you look at this problem, we can do this by maximizing the likelihood of the Gaussian model. And this problem is -- it's not too difficult to show. It's hard in the worst case, okay, which means that I have some inputs in which I can never really maximize the likelihood. If you prefer to think of it in terms of the Caman's (phonetic) objective it's similar to Caman's clustering. Minimizing Caman's objective is hard. You can't have an algorithm that for any input data minimizes the Caman's objective. But this statement is almost meaningless because I mean those instances in which you can actually prove that this is hard, the instances you get in that reduction are those instances in which you really don't have any clustering. They're like a bunch of points carried in a very specific way in which there's no good clustering. I mean, sure, there's one clustering that has a slightly less bad objective than all the other clustering and finding that clustering with the slightly not quite as bad objective is hard, but who cares about that? It's not actually going to give us an actual clustering. This situation doesn't actually correspond to a situation which clusterings and we actually want to construct them. So, on the other hand, there are results that say that actually if you do have a very distinct clustering and clusters are very, very well separated, then it's actually very easy to -- very easy, but then we actually do have a problem with algorithms to reconstruct the clustering. So what those guarantees a long series of papers starting with the work by Sanjay Gupta in 1999 and leading on to progressive improvements handling more complicated situations and reducing the required amount of data. But really all these results require the clusters to be extremely well separated and requires to have lots and lots and lots of data. More maybe importantly the practical experience has been that if the clusters are even just reasonably distinct. I mean you don't have to be very well separated and you have enough data. Just run EM or your favorite local search method and it will find the clustering. And this really -- okay. So now -- so this is again if there really is a distinct clustering and if there is a lot of data. If there isn't distinct clustering or if there's too little data to identify the distinct clustering, it's not really interesting to cluster. If I have too little data, the problem is not in the computational problem. The problem is a statistical problem. I cannot reconstruct the clustering no matter what I do. So you have too little data I can't do anything anyway. If you have a lot of data it's computationally easy. So this kind of leads a general saying that I think many people subscribe to the clustering isn't really a hard problem. It's either easy, have lots of data use EM and find it or if that doesn't work it's probably not interesting. There's probably no clustering there to find in the first place. And what we're going to see -- what we're going to check is whether this saying is correct or not. In particular we're going to say that this is actually not the case. So let's look at the situation in terms of sort of cartoons of the likelihood. So what that's saying, looking and revisit the situations we described in a previous slide. Here, if I have a lot of data, and you can think of this as a cartoon, if you like. There's lots of data, the answer is just there's a huge peak first to the correct model, very easy to see far away, very easy to computationally. It's easy. If I don't have enough data, then it's true that maybe around the correct model there's, which one is it here, there's a peak but there's also lots and lots of other peaks of likelihood and various other methods that have nothing to do with correct model also have high likelihood and statistically it's just impossible to discern between the correct model and the model that's not incorrect. And what we're going to claim is there's actually intermediate regime, where there's just enough data so that statistically the solution, the correct model is identifiable. The peak at the correct model is higher than all the other peaks. But computationally it's hard to find. It's not distinct enough to make the problem computationally easy. In particular, we're going to look at the informational limit of clustering. How much data do you need for the clustering problem to be information possible. This is regarding computation. Again this is similar to the statistical limit that we had before. And this computational limit which is how much data you need in order for the problem to be computational tractable. And I wish I could give you an analytic analysis of these two limits and prove to you where they are. I can't do that. What we did instead was we did a very detailed empirical study. I'm not going to have time to go into the details of the empirical study. It appears in our ICML paper from two years ago, from ICML 2006 or you can ask me about that later. Essentially what we did is empirical study trying to quantitatively evaluate these two limits. So what we have is several million results that look like this, what we see here, as a function of dataset size, we see the error of the clustering. The clustering, how close are we to the correct clustering. You know, it's all simulated data so I know what's the correct clustering here of the maximum likelihood solution, or what seems to be the maximum likelihood solution and what we get with local search. And what we can see here is we can identify two fairly distinct (inaudible) transitions. We have the informational limit. This is where EM -- where the maximum likelihood starts to be meaningless. If you have less data than informational limit, then even if I can find the maximum likelihood model, it just has random error. It has nothing to do with the correct model. Beyond this point the maximum likelihood model is good. It starts to become a bit better. And what we can see is if we have enough data, if we have more in this case than about 4,000 examples just local search. EM. This is actually EM initialized with some PCA, but it's essentially very simple methods. Do find the correct model. So the problem is computationally easy. But there's a very wide regime between them. In this case, about three-fold increase in the amount of data. So I need about three times as much data for the problem to become easy. And in all this regime if I have between 1200 and 4,000 examples, the problem is actually interesting. I mean there is, if I had instance computation researches I could actually get a meaningful clustering, the problem is computationally statistically interesting but computationally appears to be hard. EM fails and we don't have any other methods to reconstruct the correct clustering. How does this relate to what we talked about before? So now if we look at the runtime required to find the correct clustering is a function of dataset size, we can have a similar plot before it. It looks a bit different. So we have this information limit that we had before. We had less than this much data we're lost, doesn't matter what we do. The problem is statistically hard. Once we cross that, the problem becomes statistically possible, but at the cost of very difficult computation. In particular, in this case, it's possible if we enumerate over all possible clusterings. So you have an exponential amount of computation. And this runtime, if you can think of it as a decreasing runtime where the decrease is not a gradual decrease where you had before, but it's just a drop where at some point we can start using methods that are much more efficient, normal time methods, just local search methods, and the problem starts becoming tractable. So we, again, see here a decrease in runtime as the dataset size increases. But in this case this is not a gradual decrease as we had before but a drop. A drop in exponential runtime, dropping to the point, starts dropping from being intractable to being tractable. And very quickly just sort of -- we can actually study the width of this, this is what I call the cost of tractability. How much more data do we need for the problem to become tractable? What's going on here? We can study it. So I'm not going to go too much into details. Generally you can see here what the information limit and computational limit are as a function of the dimensionality and as the function of the number of clusters. In both cases you can see a linear increase, both increase linearly with the number of clusters, which means that really the issue here is what's the dataset size compared to the dimensionality number of clusters. So we can look at this plot, which plots the sample size per dimension, per number of clusters required for reconstruction as a function of the separation. The separation between clusters. So, of course, the more separation we have, the less data we need. The problem is easier. But more interestingly, what we see is this is the informational limit. This is the computational limit. This gap, so this cost of tractability, how much more data do we need to solve the problem tractably actually increases when we have a larger sample size. So this is a bit maybe at first thought counterintuitive and opposite to all the theoretical results that say if we have a large separation then things computationally are easy. Actually, what we're seeing here is the opposite. Actually, when we have a large separation. I mean sure in absolute numbers we needless samples. But the gap is much bigger. If we have more separation actually in a sense the computational aspect of the problem becomes harder relative to the statistical aspect of the problem. Okay. I'm not going to spend too much time on that. This is what I said before. To sum up. So I hope that in both parts of the talk and now maybe you see the connection between both parts of this talk, we saw this relationship between data and computation. And the standard view of the relationship between data and computation in computer science and traditional study of machine learning is that if you have more data, have to do more computation. And this is reflected both in the way we study algorithms, and traditional notions of complexity, look at how badly does the runtime scale up when I have more data. Does it go up only quadratically or does it go up exponentially. But it's definitely how much does it go up when I have more data. Unfortunately, also in machine learning this is often the way we think about problems. We think -- we say oh my God, I have 10 million examples, how will I ever be able to crunch that. And we're expecting computation to increase when I have more data. And really definitely in machine learning this relationship is reversed. The more data we have, the better we are. We should be happy that we have more data. Not sad that we have more data. We have more data. It should mean that we should be able to do less work. We're using data as a replacement in a sense for doing work. And which really means that we should look at study runtime not as an increasing function of the data but as a decreasing function of the available data. And instead as an increasing function of the true parameters of the machine learning problem of how hard the problem is to separate, for example, in terms of separation in clusters or margin or any other parameter that really measures the hardness of the problem or in terms of how good we want the solution in terms of the desired error. And sort of this -- and what we saw here is we saw -- okay. Sorry. Maybe before -- this is an important point. I talked about SVM and clustering, and I want to go back a bit to the clustering. So I talked here about clustering. But really this type of behavior, this phase type transition of behavior appears, seems to appear in many other problems clustering is a problem I studied the most as far as the very detailed empirical study. I do want to point out several other problems. Again, we see this transition from the problem being intractable when we have only limited amount of data to be intractable and we have tons of data. So this happens for learning structure dependency networks. We know the problem is hard in the worse case but seems to be easy when you have tons and tons of data. So much data that we can actually measure mutual informations very precisely. The situation is not completely resolved yet, but I'm pretty sure that that it could be shown -- it can be shown hopefully will be shown by somebody that this actually is easy in every case when we have enough data. (Inaudible) partitioning similar to clustering. And also solving discrete linear equations or problem set problems. The problems are hard in the worse case, but seems we have tons and tons of data the problems become easy. And interesting thing here to study is how does the problem go from being hard to easy and characterizing what is this width of the hard region, what's the cost of tractability in establishing that there is even just a hard component to it. But going back here, so -- oops. What we see in many problems, continuous problems searching SVM and combinatorial problems and clustering that we saw in other problems, the correct way to look at the problem is to study how computation goes down with more data. And this is something that unfortunately we don't have very good understanding of. We have a very good understanding of how error goes down when you have more data. So there's decades of research in statistics but also in statistical learning theory that tells us how much data we need to get a certain error. How does the error decrease when we have more data. But especially in today's world where we often have access to massive amounts of data, can often be useful to use the access data not in order to get 0.001 decrease in error but rather to get a significant decrease in runtime. And that's something that's very interesting to study, both from an empirical perspective and a theoretical perspective. And just want to again remind you that for specific for SVMs for linear kernels we sort of solved this in terms of by showing the stochastic gradient descent has this behavior and I'm looking forward to seeing methods for such behavior so for other continuous problems such as SVMs in which we get -again remind you this behavior on the left. And then understanding combinatorial problems, why exactly characterizing the behavior that seems to happen in many combinatorial problems. Okay. So I think that's it. >> Ofer Dekel: Questions? >>: So one thing that is sort of missing is the characterization of the meaningful minimum error. So in a lot of problems we care more about data is because that actual . 001 matters because we're talking about a class that's present 1% of the time or 1% of 1%. And sub sampling data means throwing it out in multi-class problem if you have a power log and class representation, it means that sub sampling you're throwing out the majority of all your classes. And yet you're getting the big class but who cares. That's basic. >> Nathan Srebo: So I mean the study that we present in particular, if we go back to theoretical analysis, for example, does definitely include that error term in it. So that error term appears here. And if you really care about that 0.001 it means that you really need to drive this lower and you're going to pay for that in terms of runtime. So this is definitely consistent with the study. There is, nevertheless, it may mean that you need a smaller error. But for whatever error that you are willing to tolerate, for that error, if you have more data, the runtime should decrease. Now, if you're talking specifically about a problem where you say, okay, my expected error over all examples doesn't really measure what I really want, because it's dominated by the big class, and what I really want to do is find the rare things. In that case, what you should do is look at a different error measure, the error measure that you do care about. For example, something in terms of fair precision recall or whatever other measure that you have that you measure how much that really captures those rare things. And do a similar study for that error measure. So the study represented here is for using SVMs for standard prediction task or what I care about is the expected error. So the basic idea and the basic concept is still valid. I mean whatever error measure you care about, error measure precision recalling measure, you still expect to get such behavior. The specifics of the math that I showed would not apply to your setting specific for this SVM setting. Again, I believe you can come -- you did it enough you would be able to derive the math that's specific to your setting. Does that answer the question? >> I guess what I'm saying is that I'd rather have a characterization of how much error should I care about, because I'm skeptical of deriving it for things like other measures because I continuously see (inaudible) work on error instead of other measures because presumably that's what's easy to prove or actually what's even in practice are these other measures. >> Nathan Srebo: Okay. >> I was listening if you had some insight how do we get a characterization of how much error should I care about given things like, you know, class presence or other variables, because that's why usually you retain all of them. >> Nathan Srebo: So my answer to that would be that what is not how much error you should aim for in that case, but in that case just looking at the error is the wrong thing. You should be looking at other measures. And I think it's true that maybe if you look at theoretical work, it's probably true that it's sort of gives an overemphasis to very straightforward error measures because they're much easier to work with. Nevertheless, there are theoretical, there is theoretical work on things like precision recall and other measures that do pay attention to rarer events. I'm not specifically familiar with things to point out to you, but, again, my basic answer is you should study performance in terms of the error measure that you do actually care about and is suited to your particular need or application, rather than think about what error, what very, very small error is going to guarantee what I want. >> Here's another -- I want to sort of represent what you're saying in a different way. You can correct me if I'm wrong. So would it be true to say that essentially if you have a large dataset in your sampling, which is what Pegasus does, right? Then you're going to get like really say unbiased sample of the real world, because if you actually have enough large amounts of dataset, you actually have the real world out there that you're sampling from, whereas if you don't do sampling, if you just take everything into account, you're not going to -- you're just going to K runtime (inaudible) so essentially think of it like when you have a small dataset, you have potentially a biased sample of the world whereas if you have a large dataset which you sample from, you probably get a closer to an unbiased sample. >> Nathan Srebo: So definitely at least the second thing that we said exactly characterizes why Pegasus gets this gain. Pegasus being a stochastic gradient method, stochastic methods, what we do we have a big bucket of examples. And each iteration we pick one example from our bucket, look in it in our tiny computation and put it back and then pick another example and repeat. And so the best thing we could hope for is a bucket of examples we get fresh new samples from the world all the time. That's not the situation, what we're given with the training set we're given a fixed bucket. The bigger that bucket is the better it sort of represents the world. So the better we're not going to pay for it in terms of runtime. The runtime is going to be the same. But other things are going to be better. And so we're going to get better samples each iteration and be able to kind of do a more effective step at each iteration. So that's definitely true. Comparing that to batch methods is a bit trickier. So definitely stochastic generally stochastic methods seem to be dominant -- have dominant performance in machine learning type of applications because in the sense we don't care about -so it's very difficult to get very small optimization accuracy with stochastic methods. The convergence rate, if I was giving this talk in optimization conference, I would be kicked out of the room. This was -- I don't know if there are any OR people here, if they're like they feel like oh what is he doing. Because really getting, if we go to the runtime guarantees, I mean getting performance guarantee which is scales one over the optimization error, this is horrible. So from a pure optimization perspective, stochastic methods are problematic, and the reason I think why stochastic methods are better than batch methods, and I do think this is a general principle, stochastic methods are preferable to batch methods in a learning setting is because we really don't care about getting very small optimization error. There's no point in getting optimization error which is much smaller than inherent error you have anyway because of estimation error because of approximation error. And so I'm not sure this -- this is, I think, a more subtle issue, because in some cases batch methods are better. And it's not so clear to say. In learning settings, this type of analysis shows us that stochastic methods dominate. Batch methods is also a nice analysis by lambda two from (inaudible) that was for unregularized linear prediction and low dimensions. It also had similar to look at the data light in that case also and got some results showing the stochastic methods seem to be particularly appropriate for machine learning applications because of that property. >> How do you think about (inaudible). >> Nathan Srebo: I'm sorry? The data? >> If you have incorrect label. >> Nathan Srebo: So that would come out. Note that we are allowing our -- we're saying that we have some predictor that has large margin but also has some error. So we are allowing the -- we are allowing there to be noise in the system. Of course, the more noise you have in your labels, then the more noise you're going to have also in your answers. >> (Inaudible) I don't see how you can get enough in because ->> Nathan Srebo: You can. That's my problem. We can kernellize Pegasus but we don't get the same runtime guarantees because we do get a runtime dependence. >> So are they working (inaudible) transition, if there is any, or knowing what your (inaudible). >> Nathan Srebo: I method is you should limit the number of support vectors you have. So you should kernelize it. If you kernelize in a straightforward way, you will have limit the dependence. But you should kernelize but be very reluctant to add new support vectors. I don't know how to do this. I'm going to say I don't have an answer here for the kernelized case. Straightforward kernelization would have runtime dependence. I strongly believe it's possible to come up with a natural method that would have a similar scaling even for the kernelized case and I don't know how to do that. If you have a good idea to do that ->> Some of the problems that are based on the inference and the assumption there if you have more data it appears to be nominal because of the (inaudible). And that kind of fits in with the (inaudible) that the problem fits in there because it's a normal distribution finding the model is much easier. >> Nathan Srebo: I'm not sure -- maybe we can talk about this off line. I'm not sure I agree to that because, for example, in all these problems you have multiple modes. And it's only normal around each mode. And the real problem is finding the mode. But the real problem is finding ->> (Inaudible). >> Nathan Srebo: Well, usually that's the hard part. I mean the combinatorial aspect of the problem, finding the correct mode is the hard problem. >> That's the big part. But like, say, the bayesian parts for the Gaussian, for instance. >> The convex problems. >> Not convex problems, but this is the iterations. The more successful approaches are the ones that are approximate to Gaussian so you can actually prove posterior (inaudible) Gaussian model. Again, I'm saying that that kind of fits in from bayesian perspective analysis rather than from an optimization perspective, where, as you see more and more data, your posterior is easier to estimate, easier to approximate. >> Nathan Srebo: Maybe we can talk about this. Other questions? (Applause)