16051 >> Ofer Dekel: So it's our pleasure to host...

advertisement
16051
>> Ofer Dekel: So it's our pleasure to host Nati Srebo today from TTI in Chicago. Nati earned his Ph.D. at
MIT. After that, he spent some time as a visiting researcher both at the University of Toronto and after that
at IBM Research in Israel, Haifa Labs. And now he's a professor at the Toyota Technological Institute in
Chicago, and he'll talk about his work which won the Best Paper Award this year at ICML this year. It's
guaranteed to be enjoyable for everyone.
>> Nathan Srebo: Thank you. Okay. Hello. So what I'll be talking about today is actually two different
things: Mostly I'll be talking about training support vector machines. I'm going to explain exactly what
specifically I'm going to, what support vector machines are in particular in a bit. Mostly talking about
training support vector machines then probably the last maybe quarter of the talk, maybe a bit less, also
talking about clustering.
And both of these I think eliminate fairly different aspects of an underlining theme which I find is a very
interesting and important theme. And I'll return to it then, which is looking at interaction between
computation and data and, in particular, seeing how computation becomes easier as we have more and
more data. I want to point out this is based on several pieces of work. So the -- mostly with Chesof
Shvelts (phonetic) and also (inaudible) and previous work with Greg Shenolovitch (phonetic) and Sam
(inaudible).
So I'll start with support vector machines. Quick reminder. I'm assuming that most of you here have seen
SVM in some form or another. From my perspective really SVM is just linear aspiration. We have a bunch
of training data points. Based on these training data points we want to make future predictions whether the
future point is green or red. We do that by finding the linear separator, not just any linear separator. Linear
separator with the low margins, maximizes the separation between the separator and the data points.
And if I look at this (inaudible), what this means is if I characterize the linear separator with its vector that's
normal to it, inner product of this vector with all the points has to be more than the margin or less than
minus the margin, and equivalently we can renormalize things instead of talking about a unit vector require
the margin to be one and what corresponds to the margin is just the norm of the vector one over the norm
of the vector, which means that we can -- sorry.
Also we maybe can't hope to get or want to get complete separation. There might be some noise. Might
be some points that we can't quite separate and that's okay. We're just going to pay for them. And how
much we're going to pay for them is just how far they are from where they're supposed to be. For example,
this red point that's on the wrong side of the margin, we just pay this much.
Okay. So what this means is that basically we can think of this SVM training as a problem of finding this
best separating large margin hyperplane is a problem of one maximizing margin, which means minimizing
the norm of W. On the other hand, minimizing the error. And, again, the error is this so-called hinge loss
which just looks at how far points are from where they're supposed to be.
And the end result of all this quick story is that separating is that training support vector machine essentially
boils to solving convex optimization problem which is minimized on the combination controlled by
organization parameter here in this case (inaudible) lambda of the norm of W and error.
Okay. So, again, I presume that many of you have already seen such presentations of SVM. And so we're
left with this optimization problem. How do we solve this optimization problem. So this is just a convex
optimization problem. It actually can be written as a quadratic program. And we can throw it into QP
solvers. So straightforward optimization methods for this problem based on popular methods for solving
quadratic programs, interior point methods, would take times that scales roughly fourth or maybe into the
three and a half maybe, with a size of the dataset.
So this is fine for solving SVMs for size of up to maybe 100 variables or so. And there's been much
improvement of this. So the most popular methods probably now for dealing with the support vector
machine optimization are based on decomposing the dual problems about SMO or other approaches. And
they can get the runtime down to scale quadratically with the size of the problem. So we have the
quadratic piece of the problem with the size of the problem.
The cost of only linear as opposed to quadratic convergence. So dependence on log accuracy as opposed
to log-log accuracy. And more recently there have been methods suggested. This is specifically for linear
problems, which scale only linearly with the size of the problem. So all these methods look at really the
problem's optimization problem. And all of them tell us that when we have more data, then we should be
expecting to work more.
And they sort of just try to push that down to be this -- we're going to work more but maybe not as badly.
So instead of having a fourth order increase, maybe we can go down to only linear increase with the size of
the data.
But in any case, we study the problem here as we do for most problems with machine learning and
computer science. In fact, we look at the runtime as an increasing function of the size of the dataset.
And my main point I'd like to make is this is -- this study seems somehow wrong. So the runtime should not
increase with the amount of data. So let's look at what's going on here. So what do we use SVM for in the
first place?
We use it in order to get some predictor that will have good error when used in some tasks, character
recognition or whatnot. We don't want to do character recognition. We train in SVM and use it as
character recognition. Suppose that given 10,000 examples, given 10,000 examples, work for an hour, and
produce a predictor that makes 2.3% error on my character recognition test. So that's great. Now you give
me, and maybe I'm happy with this 2.3% error. Now you give me a million examples, give me 100 times
more data.
So what the traditional SVM optimization approach would say is now I'm going to crank this through my
optimization solver. Scale it best linearly and more realistically quadratically with the size of the training
set, which means instead of working for an hour they're going to work for maybe a week.
And after that week, okay, I will get a somewhat better prediction. Instead of getting 2.3% error, I'll get
2.2.9% error. But if I was happy with my 2.3% error, there's absolutely no reason for me to work for this
entire week. Definitely what I can do is just sub sample. Take only 10,000 out of those one million
examples. Throw away the other 990,000. Do exactly what I did before: Work for an hour just like I did
before and get that 2.3% error.
In this general argument, generally apply. So when studying machine learning problem with respect of how
much work we have to do in order to get some performance. And performance here is target accuracy.
How much work you have to do. The amount of work should never scale up with the amount of data that I
have.
In fact, what we'll see is what we can hope for and can get is it should scale down. Instead of just sub
sampling, I can do something more sophisticated, use those 10,000 million examples, not throw them
away. Instead of working for an hour, work for only ten minutes and get the same 2.3% error. So what
we're going to see is how to look at runtime as a decreasing function of the amount of available data. How
we can use more data in order to make training take less time.
But in any case, what definitely I hope I've convinced you is that you should definitely not study runtime as
an increasing function of the amount of available data. At the very least it should not increase.
Now, you might be thinking, okay, you don't really buy this because maybe you really do care about that
.01% increase in performance. Maybe it's really important for you to get better performance, to get better
predictive accuracy. That's fine. But then you should study how runtime increases as a function of what
target accuracy you want. That's fine. If you want more target accuracy, you'll have to work harder. And
that's fine. And you should -- so if you tell me I want you to work harder, so if you tell me I'm going to take
more time because I want higher target accuracy, that's fine. But if you are saying, no, I have a large
dataset, I am going to have to work harder. That's wrong.
Some problems are really hard. Some problems really do require to work for, to crunch for a week. You
might have a problem that's so hard really you won't get anywhere with crunching for a week. That's also
fine.
But in that case, what you really should be studying is how runtime scales with the hardness, inherent
intrinsic hardness of your problem. So particular SVM problems, large margin problems, maybe this
hardness is captured by the margin. If you have a very, very small small margin, yes, it's harder to learn.
In that case you can say, okay, if I have a small margin, I have to run for more time. I want to look at how
runtime scales, increases as the margin decreases, or the norm of W increases.
But, again, just saying I have more data, because I have more data I'm going to run for more time, looking
at the runtime dependence which is increasing scales function of N is wrong.
So we'll see this particular for SVM. Let's look back at SVM training and look at what's going on here. So
we wrote SVM training as an optimizing an objective. And the issues that all the traditional runtime
analysis is in subtle runtime analysis, also the traditional optimization methods for SVM really treat this
optimization problem.
And to get a very accurate solution on optimization objective. How the runtime I showed you before, say
how much runtime does it take to get within epsilon of this optimization objective. That's not really what we
care about. It's true that we optimize optimization objective, but that's not what we want to minimize, what
we really want to minimize is our true objective which is prediction error. You want to get low error on
future examples.
So why don't we just minimize this directly. When we can't minimize this directly because we don't observe
our future error. So but that's what we really care about. So instead of minimizing the future error, what we
actually do is minimize some surrogate written in terms of the training error and randomization and so we
minimize our optimization objective as denoted here by F. Doesn't mean that when we study runtime we
should study it in terms of how low we get F. Because I don't really care to get my optimization objective
really, really low. What I care about is getting my predictive error very, very low. What we'll do is study
optimizing F in terms of how well do we do on the error.
So, in particular, we're going to look at how the computational cost of optimization as we said before
increases as a function of the desired generalization performance. So it increases not in terms of the
epsilon F in terms of how accurate I get F, but it's going to increase in terms of how low do I want to drive
my error. It's going to increase in terms of how hard the problem is. In terms of how small the margin is.
But it's going to decrease as a function of the dataset size.
And let's see how we might hope to get that. And in order to do so, it will be useful to look at the error
decomposition. I'm sure many of you have already seen the standard error decomposition machine
learning. And usually in learning tasks we talk about, we decompose there into approximation error.
So this is the error that kind of says, okay, even if we had tons of data, this is the best error that we can get
with hypothesis in our hypothesis quest. In our case, you can think of it as the best error achievable by a
large margin separator.
It's actually a bit trickier here, since we're talking about regularizer and not strictly limiting a hypothesis task.
But we can define things appropriately. The error achieved by the population minimizer, if you had
absolute infinite data what would be our error.
This is in some sense unavoidable error in a sense. And on top of that we had estimation error. And
estimation error results from the fact that we don't have infinite data. We're only doing everything based on
an estimate of the error which was given by the training error.
This is a standard decomposition, I'm assuming many of you are familiar with. But, really, this
decomposition refers to this W star which is the mathematical minimizer of the SVM objective. But we
never have the mathematical minimizer of the SVM objective. After all, we run some SVM optimization.
And it runs for some finite time and gives us some predictor which is finitely close to the true
mathematically defined optimum. So really what we have to add here is another error which is the
optimization error. And this is the difference between the error of the mathematical optimum and the error
of what our algorithm actually returns.
And I should point out that very similar composition has been suggested, talked to us about a year ago by
(inaudible). Okay. So now, okay, so this is the composition. Let's think about what happens when I have
more data, when the training site increases.
So when I have more data, okay, the approximation is the same but the estimation error decreases, right,
because estimation error depends on the size, the more data I have, I can do estimation better. The
estimation error goes down. But recall that our target here is to get some target performance, our desired
2.3% error.
If estimation error decreases, I can -- in some sense the mathematically defined optimum is better, that
means I can do a much worse time optimizing. I don't have to optimize as well. I have so much data that I
can just sort of, if I really optimized it all the way, maybe I can get lower error. But I can afford to just do a
bad job optimizing.
What this means is in particular I can take maybe less time optimizing. If I have theoretical optimization
methods, most of ours do, I can run for less iterations. That's good. That means I can run in lower runtime,
which is what I want.
However, with most optimization approaches, the situation is a bit trickier. And really increasing dataset
size is a double-edged sword. On the one hand I can work for less iterations. I can do a worse job
optimizing. On the other hand, each optimization iteration takes much more time.
So traditional optimization approaches, each optimization iteration I look at the entire dataset, maybe do
something complicated with the entire dataset. Each optimization iteration by itself scales linearly, maybe
quadratically, with the amount of data. Overall, I'm not going to win here, at least not win much. Do less
iteration, but each iteration is going to be much more expensive.
So this is why we don't get this type of behavior with standard approaches but we do get this if we move to
stochastic gradient descent. Last year we presented specification to stochastic gradient approaches, which
I'll present on the next slide, called Pegasus. And the important point about Pegasus or any stochastic
gradient descent optimizer for that matter is that the runtime for each iteration, you only look at the single
example so the runtime of each iteration does not increase with the number, with the sample size.
So once we can optimize sort of, we can do worse job optimizing, we need less iterations, the iteration is
fixed and we really do get a gain in runtime.
So let's look at Pegasus, is Pegasus is a fairly straightforward stochastic optimizer.
What we do with each iteration we look at one random training sample from our training set. We look at
our objective limited only to that example. So this is what -- the error in that example plus the regularization
term. We look at the gradient, actually sub gradient, but it's really not important. Think of it as gradient of
this function, and take a step in the direction opposite the gradient with predetermined series of step sizes
that behave as roughly 1 over T. So 1 over the number of the situation. So each iteration taking smaller,
smaller steps. That's all. So each iteration is very simple. Takes random point at the gradient, requires
only looking at the example and you take a step in that direction.
So the cost of each iteration just scales linearly with the representation of the single example. And the
major important result here is that in order to get within epsilon off the sort of -- off the optimum, off the
objective function. So, again, this is now epsilon on the optimization objective, we only need one over
lambda optimization iteration. So the number of iterations does not depend on the size of the training set.
Which means that, okay, as I said before, the cost of each iteration only scales linearly with the size of
each example, even if it's sparse it's even better. The size of the presentation of each example, which
means the total runtime for Pegasus to get within epsilon on optimization objective does not increase at all
with the size of the training set.
It's just the dimensionality of the data, the sparsity of the data over lambda, which is the regularization
parameter times epsilon.
So what we see here is that we have no runtime dependence on the training size. I promised you
decreasing the training set size. We'll see it shortly because remember this is runtime dependence for
epsilon which is optimization epsilon.
So before going on, it is important to realize here that Pegasus is also a good method. I mean I'm going to
quickly flash a slide with some results with. Major point here is that Pegasus, these are some standard
large sparse SVM training benchmarks and Pegasus gets significantly better performance than previous
methods.
But this is really not the major point I want to talk about. So let's move back to actually comparing the
actual run times. If we look at the runtime of Pegasus compared to other methods, it's actually a bit difficult
to compare the run times. So now I'm, again, talking about this traditional optimization runtime comparison.
It's a bit tricky to compare these run times, because on one hand Pegasus doesn't have a scaling at all in
the training set size with other methods have linear even quadratic or worse scaling.
On the other hand, Pegasus has a very bad dependence on optimization accuracy on the epsilon, which is
how close do I get to the optimization objective, whereas other methods have only logarithmic or maybe
double logarithmic behavior. And also there's like this lambda, the optimization trade-off parameter comes
into the runtime and I don't know exactly how to look at that. Is this better or is this worse? I don't know.
And this is in a sense from my perspective, because as we said before, this is really runtime analysis in
terms of the wrong parameters. These are not parameters of a learning problem.
The optimization accuracy, the parameter of the optimization problem, not learning problem. It's not
something I care about. I care about what error I'm going to get.
This optimization trade-off lambda is some mechanism I use in order to balance the margin and error. It
doesn't really tell me something about how hard or easy my problem is.
So what we're going to do before -- so I'm still going to show you how we get a decreasing dependence on
runtime. But before we do that, let's just try to understand these run times and we'll see that's going to be
helpful for us also for seeing the decreasing behavior, and we'll try to understand it as we promised before
in terms of the parameters of the learning problem, in terms of the desire generalization performance,
desired error.
Not the optimization epsilon, but the actual error I'm going to get in future predictions and in terms of the
hardness of the problem, inherent hardness of the problem, how big a margin is it? How noisy is it? And in
order to do that we're going to look at the following situation.
So suppose there is some large margin predictor. I don't know it. I want to learn it. We're assuming that it
is possible to separate the data with large margin and low error. In particular there's some predictor W not
that has low norm, because pointing to large margin. And small error.
And what we're going to ask is what's the runtime in order to get error, which is almost as a predictor W
with error which is almost as good as this Oracle predictor we know exists. But, of course, we don't know
it. What we're going to look at, we're going to study the runtime required to find such a W as a function of
this epsilon.
And the margin. Okay. Now, again, this epsilon is a very different optimization epsilon. This epsilon
actually talks about the quantity that really interests me. The predictive error. Okay. In order to do that I'm
going to go fairly quickly through this analysis. We can decompose our, the error of W. This is roughly the
composition that we had before to the error so we're comparing it to the error of W not. So we're
decomposing it to the error of W not, plus and then we're sort of adding and subtracting the expected
objectives. What this boils down to is we have the error of W not. Another term here which is roughly the
approximation error which depends on the norm of W not.
Plus these two terms which replace the difference in the expected SVM objectives which is the difference in
the actual SVM objectives of the two predictors, and the term that measures how quickly the actual SVM
object approaches its expectation.
Roughly speaking, this is an approximation error. And what's going on here? This is optimization error and
this is estimation error. But the details here are not so important. Okay. Sorry. What's important is that
then I can bound -- I can bound this term here. So what is this term here? This term here asks by how
much bigger is this SVM objective on W than SVM objective on W not. Now, I know that if I actually
optimize my objective, I will find a W that actually minimizing the SVM objective. So this term will never be
positive.
However, I don't ever completely optimize it. I only optimize it within epsilon. So I can bound this term in
terms of my optimization accuracy. So this epsilon is the epsilon that measures how close I am to the
optimum of F.
And now recall what we wanted was we wanted to get a predictor W that has overall small error. And in
particular it's error equal to the W not plus epsilon. In order to get that, we definitely have to bound each of
these three terms by something which is order epsilon, because we want the sum to be order epsilon.
And what this tells us is now we can go back and from each of these three inequalities extract bounds on
what does lambda have to be. What does epsilon accuracy have to be, what's optimization accuracy have
to be and how big a dataset size we need.
What we see is in order to get such a predictor that we want, we need lambda to scale roughly as epsilon
over the, epsilon times the margin square. Optimization accuracy to be roughly the same order of the
epsilon we want here. And the dataset size to scale in the familiar way, grow with the margin squared and
the target error squared.
So what we can do with this now is go back to our traditional runtime analysis. Our traditional runtime
analysis was in terms of the dataset size that we use and this optimization accuracy and this lambda and
plug in these requirements on lambda on the optimization accuracy and the dataset size into our traditional
runtime analysis.
We do that, we get in runtime analysis, which now is in terms of the norm of W squared. You can think of
as 1 over the margin. So the larger W, the larger the norm of W is, the harder the problem is. The smaller
the margin.
And in epsilon, which is our episilon on the error, how close are we to the optimal predictive error than we
can get.
So we got what we wanted, which is a runtime analysis in terms of the actual parameters of the learning
problem. How hard is the learning problem and how accurate a solution do we want to the learning
problem.
And what we can see from this analysis, okay, is in fact that Pegasus has a much better runtime guarantee
than other methods. It only scales quadratically, both on the margin and the desired error, whereas other
methods have a fourth order or even seventh order scaling.
So what I hope I convince you so far is only well maybe I hope to convince you maybe this is a better way
to look at the runtime of SVM optimization and it provides a better way to sort of compare different methods
in terms of how well do they actually deliver to us what we want, which is a good predictor. Looking at it
this way, Pegasus actually does have significant benefits over other methods, which explains our empirical
results also.
But notice that in doing all this analysis, I got to set the dataset. So I chose whichever dataset I wanted.
What does this mean? This means that this actually -- you can think of it as corresponding to the situation
which I have unlimited data. I can always ask you for more data or I can have infinite data source and I can
choose the optimal dataset size to work with. So I choose the dataset size that exactly suits me and gives
me the error that I want.
Okay? Now, so you can think of this as sort of a data laden analysis. This is the absolute minimum
runtime you need if you had access to unlimited data so you could choose the work with how much data
you want. But in reality we usually don't work in this data laden regime. We have some limited dataset and
that's what we have to work with.
So now let's go back and look at how runtime scales with the amount of data. So let's go back here and
look at the scale runtime is a function of the training set size. And what the data analysis tells us, is if I
have enough data, this is sort of the minimum I can hope for. The minimal data -- the minimum runtime if I
have on however much data I want.
Now, of course, it's not that with any amount of data I can get this runtime. In particular, remember that this
is, this analysis is runtime to get a fixed predictive accuracy, to get 1% error. 1% predictive error. So
there's some limit on how much training data I definitely need. And this limit comes from the standard
statistical analysis relating sample size and predictive accuracy. In order to get my 1% error I definitely
need some minimal number of examples. 100,000 examples or whatnot.
If I have less examples than this, doesn't matter how much I work, I'm never going to be able to find a good
predictor, just statistically it's impossible. So this is sort of my statistical limit. It measures just how much
samples I need. If I'm on the right of this blue line, the right of this limit, if I worked hard enough, if I really
optimized that SVM objective all the way down, I'll get the area that I want. But it's still very difficult to do
that, because now if I'm just to the right of this, I really have to optimize my optimization objective really all
the way down and do a really good job optimizing, which will take me a lot of time.
What happens is as they cross this, more and more data I can do a worse job optimizing and gradually
decrease this runtime. And we can analyze exactly -- okay. Sorry. Looking at the error decomposition,
this point is exactly the point in which the approximation error and estimation error themselves already
account for all the error I'm willing to tolerate.
So there's absolutely no room for optimization error. I have to drive the optimization error to zero.
As my dataset size increases beyond this point, there's no room for messing around. There's room for
doing a bad job optimizing and taking less time.
So in the case at least of Pegasus we can do this analysis exactly and just go essentially taking the
calculation from before and putting it in the actual expression for the optimizationnal error of Pegasus after
fixed number of iterations and get an expression for the runtime when we have only limited amount of data.
It's square root of N data. And what we now get is an expression which behaves in the way we wanted it to
behave or the way that makes sense that the runtime would behave. So it increases when epsilon is small.
When my target error, when I want small predictive error I need to run for longer. We're fine with that.
At the increases when they have smaller margin. Okay. So the smaller the margin that I know I have
there, it's going to take me more time, we're fine with that. We know that's going to happen. But it
decreases with the size of the dataset. So the larger the dataset I have actually have to run for less time.
And this function -- so this is a theoretically known and it's shown over here on this decreasing curve which
bridges between this statistical limit, down to the minimal runtime that we know that we have the data laden
limit.
So this is all just analysis of an upper bound. So this tells me, gives me an upper bound on the runtime
and the question is does this actually hold in practice? And we actually observed this behavior empirically.
So this plot shows what's the runtime training on the right dataset. So this is a fairly large standard
benchmark data sets for SVM training. And we tried to get an error of, set the target error level to 5.25
which is a bit about less than a 10th of -- one less than (inaudible) above the best you can actually get in
this dataset.
And we look at their runtime to get this accuracy if we only train in a subset of the Reuters dataset.
And what we in fact see is as predicted by theory, when we have more data, the runtime actually
decreases. And you need at least for this specific example, we need to about double the amounts past the
point that you have enough data to get the 5.25% error passed the statistical limit. You need about twice
as many, that's all. So this example is only twice as many. The example is 10 times as many examples in
order to be able to reduce the runtime. So, sorry, to be able to get to the (inaudible) runtime. In particular,
when you have about twice as many examples we reduce the time here by about seven-fold. So you have
twice as much data, it means you have to run for seven times less time in this particular example.
So we see here that we get this behavior of decreasing runtime as data size increases, this is the birth, the
behavior that kind of argues the behavior that one would expect. And as far as I know this is the first
presentation that we see it both theoretical analysis of how and why and what we would get that, and
empirical study actually showing that we actually get this in practice and it's important to, again, note that
we're getting a decrease on the best method for this dataset.
Okay. Now, I want to go back here actually and now correct something I said, because the situation there
is a bit more subtle than what I said. What actually happens as we increase the training set size, not only
do we increase the optimization error, but actually we also increase the approximation error. So when we
have more and more data that's estimation decreases so we could have just kept the approximation fixed
and increase optimization error. It's also better to increase the approximation error. What do I mean
increase approximation error for a second ignore what's written here. At least intuitively the approximation
error measures the error on the fixed hypothesis class, now considering the hypothesis class I'm going to
consider a more hypothesis class. Smaller hypothesis class. Means I'm making my life harder I have less
flexibility in choosing a predictor which means I'm going to have a higher approximation error. I'm going to
do a worse job approximating my target.
But the process class is smaller, which maybe means that it's going to be easier for me to run on it. It's
going to be lower runtime, which is what I want to get here. So this is what's going on here. Here the
situation is more subtle. What controls the approximation error is lambda. When lambda, when lambda is
very large, it means that I'm very restrictive in what predictors I can choose and I have large approximation
error. When lambda -- when lambda is very small, then it means I'm very unrestricted and I have high
approximation error. Sorry. Low approximation error.
So in particular what happens here is as we increase the training set size, we can increase lambda. So
this is for those of you that might be familiar with results and consistency or just generally in training, you're
used to thinking of decreasing lambda as the runtime increases and the reason we decrease lambda is in
order to decrease the approximation error, because that's what we do if we want to drive the error to zero.
But here what we care about is driving the runtime to be small. Within our budget of what training error we
allow. And the best way to do that is actually to increase lambda, which means that we're doing more
equalization but taking less runtime.
Okay. So back to the main story here. Okay. So I have 20 more minutes or so. Okay. So I just want to,
before going on, we said that for stochastic grading in the sense for Pegasus we get decreasing behavior.
We can also ask what happens for other methods.
So we can repeat the same analysis for dual decomposition method such as SMO and SVM perf. What we
see in those cases is an increase in runtime, increase of dataset size.
If I increase the size of the amount of data which I run the algorithm on, well, at first I get a very sharp
decrease when really I'm allowing no optimization, no optimization error to allowing a bit of optimization
error. But then very quickly the runtime increases.
In particular, there's some optimal dataset size which going beyond that optimal dataset size is only going
to do me harm. So running the data on larger and larger data actually, on larger and larger, the optimizer
running on larger data takes more time which is maybe not surprising, matches a standard analysis of
these methods and it's kind of what we're used to. If we use more data, then it takes more time. Again,
this type of behavior is in some sense wrong. The same thing happens in SVM perf, which is cutting clean
methods, that's important.
Not quite as bad, we also see these behaviors empirically, so here we're comparing, we're seeing these
two methods for the dual decomposition methods SMO here familiar with SVM algorithm. We see a sharp
increase. It's very difficult to detect this initial decrease. And for methods like SVM for specifically for SVM
perf cutting clean methods which is sort of much more have a much more nicer dependence on dataset
size, at least according to the traditional approach. We do see an initial decrease then runtime starts
increasing again.
This is as opposed to stochastic gradient descent where you just continue to go down. Okay. So skip this.
Okay. So let's summarize what's going on here. And first I have to make a confession. I really talked all
the time about SVM optimization. Everything I said about Pegasus holds only for so-called linear kernel.
So for the problems as they presented in the first slide which is we're given, we're explicitly given data,
we're explicitly given features in RD. We're explicitly given our feature vectors, and then we get the
performance guarantees that we get and everything is fine.
You can run Pegasus. You can run stochastic gradient descent also when you use actual kernels,
nonlinear kernels in the more familiar application of SVMs. But then things become a bit trickier. So in
particular although you can run Pegasus and get the same guarantees the number of iteration. The cost of
each iteration does increase with the amount of data because you have to represent everything in terms of
the support. In terms of the support vectors, number of support vectors can increase and you get into
trouble.
So one thing that will be definitely interesting is to get this same type of behavior for true kernellizeded
SVM. We're not yet able to do that. Although I do believe it is possible. But and the other thing is the
decrease that we saw here in the runtime is really a result of the fact that we can increase the optimization
error and maybe also increase the approximation error. It's playing on this error decomposition, and there's
a limit on how much you can gain from that.
In particular, at least theoretical analysis is almost tight. You cannot -- without assuming anything else, you
cannot hope to get much better decrease in runtime beyond what we're getting, at least theoretically.
And it will be fairly interesting, and I think very possible, to try to leverage this information more directly. So
if you really have tons and tons of data, the answer is just sort of in your face. I mean you should not have
to run -- it should really help you much more significantly to reduce runtime process. It might be possible to
do so empirically and maybe also have theoretical analysis that are based on much sort of more
complicated assumptions. Theoretical assumption analysis here is only based on assuming that there is a
large margin predictor. Maybe if you assume a more detailed Oracle assumption, assume something
about how that large margin predictor looks like or that you have many family (inaudible) predictors it might
be possible to get even better decreases in the runtime, and more directly leverage the access information.
But going beyond SVMs the basic story I told you is valid for essentially any learning method. Not only
SVMs. If you have more data, runtime should not increase. And, in fact, it should be possible to decrease
runtime when you have more data.
And I'm looking forward to seeing optimization approaches for other methods, both studied in this way
decreasing function of runtime or at least as a function of the true parameters of the problem, and not of the
optimization problem. And hopefully doing that will also bring up optimization approaches that actually
have a more, that actually behave this way and are more appropriate for learning.
Okay. So this sums up the first half of the talk, which is about SVM optimization. What I'm going to do now
is switch to something that will seem like a completely different topic, but I hopefully by the end you'll see
that it's not actually a completely different topic. That's looking at the computational hardness of clustering.
So what we saw in the first half of the talk is now if you look at SVMs, SVM is a problem, it's always in a
sense it's easy in convex of problem. The question: Do we take a day or do we take an hour? It's just a
matter of how we scale in the parameters.
Looking at clustering, and here just to make things very simple I'm talking about the simpler form of
clustering. I have data that's generated from sort of few Gaussian clouds and what you want to do is
reconstruct those that separation into those clouds, into those Gaussians. So if you look at this problem,
we can do this by maximizing the likelihood of the Gaussian model. And this problem is -- it's not too
difficult to show. It's hard in the worst case, okay, which means that I have some inputs in which I can
never really maximize the likelihood.
If you prefer to think of it in terms of the Caman's (phonetic) objective it's similar to Caman's clustering.
Minimizing Caman's objective is hard. You can't have an algorithm that for any input data minimizes the
Caman's objective. But this statement is almost meaningless because I mean those instances in which you
can actually prove that this is hard, the instances you get in that reduction are those instances in which you
really don't have any clustering. They're like a bunch of points carried in a very specific way in which
there's no good clustering.
I mean, sure, there's one clustering that has a slightly less bad objective than all the other clustering and
finding that clustering with the slightly not quite as bad objective is hard, but who cares about that? It's not
actually going to give us an actual clustering. This situation doesn't actually correspond to a situation which
clusterings and we actually want to construct them.
So, on the other hand, there are results that say that actually if you do have a very distinct clustering and
clusters are very, very well separated, then it's actually very easy to -- very easy, but then we actually do
have a problem with algorithms to reconstruct the clustering.
So what those guarantees a long series of papers starting with the work by Sanjay Gupta in 1999 and
leading on to progressive improvements handling more complicated situations and reducing the required
amount of data.
But really all these results require the clusters to be extremely well separated and requires to have lots and
lots and lots of data. More maybe importantly the practical experience has been that if the clusters are
even just reasonably distinct. I mean you don't have to be very well separated and you have enough data.
Just run EM or your favorite local search method and it will find the clustering.
And this really -- okay. So now -- so this is again if there really is a distinct clustering and if there is a lot of
data. If there isn't distinct clustering or if there's too little data to identify the distinct clustering, it's not really
interesting to cluster. If I have too little data, the problem is not in the computational problem. The problem
is a statistical problem. I cannot reconstruct the clustering no matter what I do. So you have too little data I
can't do anything anyway. If you have a lot of data it's computationally easy.
So this kind of leads a general saying that I think many people subscribe to the clustering isn't really a hard
problem. It's either easy, have lots of data use EM and find it or if that doesn't work it's probably not
interesting. There's probably no clustering there to find in the first place.
And what we're going to see -- what we're going to check is whether this saying is correct or not. In
particular we're going to say that this is actually not the case. So let's look at the situation in terms of sort
of cartoons of the likelihood. So what that's saying, looking and revisit the situations we described in a
previous slide. Here, if I have a lot of data, and you can think of this as a cartoon, if you like. There's lots
of data, the answer is just there's a huge peak first to the correct model, very easy to see far away, very
easy to computationally. It's easy.
If I don't have enough data, then it's true that maybe around the correct model there's, which one is it here,
there's a peak but there's also lots and lots of other peaks of likelihood and various other methods that
have nothing to do with correct model also have high likelihood and statistically it's just impossible to
discern between the correct model and the model that's not incorrect.
And what we're going to claim is there's actually intermediate regime, where there's just enough data so
that statistically the solution, the correct model is identifiable. The peak at the correct model is higher than
all the other peaks. But computationally it's hard to find.
It's not distinct enough to make the problem computationally easy. In particular, we're going to look at the
informational limit of clustering. How much data do you need for the clustering problem to be information
possible. This is regarding computation. Again this is similar to the statistical limit that we had before. And
this computational limit which is how much data you need in order for the problem to be computational
tractable. And I wish I could give you an analytic analysis of these two limits and prove to you where they
are. I can't do that. What we did instead was we did a very detailed empirical study. I'm not going to have
time to go into the details of the empirical study. It appears in our ICML paper from two years ago, from
ICML 2006 or you can ask me about that later.
Essentially what we did is empirical study trying to quantitatively evaluate these two limits. So what we
have is several million results that look like this, what we see here, as a function of dataset size, we see the
error of the clustering. The clustering, how close are we to the correct clustering. You know, it's all
simulated data so I know what's the correct clustering here of the maximum likelihood solution, or what
seems to be the maximum likelihood solution and what we get with local search.
And what we can see here is we can identify two fairly distinct (inaudible) transitions. We have the
informational limit. This is where EM -- where the maximum likelihood starts to be meaningless. If you
have less data than informational limit, then even if I can find the maximum likelihood model, it just has
random error.
It has nothing to do with the correct model. Beyond this point the maximum likelihood model is good. It
starts to become a bit better. And what we can see is if we have enough data, if we have more in this case
than about 4,000 examples just local search. EM. This is actually EM initialized with some PCA, but it's
essentially very simple methods. Do find the correct model. So the problem is computationally easy.
But there's a very wide regime between them. In this case, about three-fold increase in the amount of data.
So I need about three times as much data for the problem to become easy. And in all this regime if I have
between 1200 and 4,000 examples, the problem is actually interesting. I mean there is, if I had instance
computation researches I could actually get a meaningful clustering, the problem is computationally
statistically interesting but computationally appears to be hard.
EM fails and we don't have any other methods to reconstruct the correct clustering. How does this relate to
what we talked about before? So now if we look at the runtime required to find the correct clustering is a
function of dataset size, we can have a similar plot before it. It looks a bit different. So we have this
information limit that we had before. We had less than this much data we're lost, doesn't matter what we
do. The problem is statistically hard.
Once we cross that, the problem becomes statistically possible, but at the cost of very difficult computation.
In particular, in this case, it's possible if we enumerate over all possible clusterings. So you have an
exponential amount of computation.
And this runtime, if you can think of it as a decreasing runtime where the decrease is not a gradual
decrease where you had before, but it's just a drop where at some point we can start using methods that
are much more efficient, normal time methods, just local search methods, and the problem starts becoming
tractable. So we, again, see here a decrease in runtime as the dataset size increases. But in this case this
is not a gradual decrease as we had before but a drop. A drop in exponential runtime, dropping to the
point, starts dropping from being intractable to being tractable.
And very quickly just sort of -- we can actually study the width of this, this is what I call the cost of
tractability. How much more data do we need for the problem to become tractable? What's going on here?
We can study it. So I'm not going to go too much into details. Generally you can see here what the
information limit and computational limit are as a function of the dimensionality and as the function of the
number of clusters. In both cases you can see a linear increase, both increase linearly with the number of
clusters, which means that really the issue here is what's the dataset size compared to the dimensionality
number of clusters.
So we can look at this plot, which plots the sample size per dimension, per number of clusters required for
reconstruction as a function of the separation. The separation between clusters.
So, of course, the more separation we have, the less data we need. The problem is easier. But more
interestingly, what we see is this is the informational limit. This is the computational limit. This gap, so this
cost of tractability, how much more data do we need to solve the problem tractably actually increases when
we have a larger sample size.
So this is a bit maybe at first thought counterintuitive and opposite to all the theoretical results that say if we
have a large separation then things computationally are easy.
Actually, what we're seeing here is the opposite. Actually, when we have a large separation. I mean sure
in absolute numbers we needless samples. But the gap is much bigger. If we have more separation
actually in a sense the computational aspect of the problem becomes harder relative to the statistical
aspect of the problem.
Okay. I'm not going to spend too much time on that. This is what I said before. To sum up. So I hope that
in both parts of the talk and now maybe you see the connection between both parts of this talk, we saw this
relationship between data and computation. And the standard view of the relationship between data and
computation in computer science and traditional study of machine learning is that if you have more data,
have to do more computation.
And this is reflected both in the way we study algorithms, and traditional notions of complexity, look at how
badly does the runtime scale up when I have more data. Does it go up only quadratically or does it go up
exponentially. But it's definitely how much does it go up when I have more data.
Unfortunately, also in machine learning this is often the way we think about problems. We think -- we say
oh my God, I have 10 million examples, how will I ever be able to crunch that. And we're expecting
computation to increase when I have more data.
And really definitely in machine learning this relationship is reversed. The more data we have, the better
we are. We should be happy that we have more data. Not sad that we have more data. We have more
data. It should mean that we should be able to do less work. We're using data as a replacement in a
sense for doing work. And which really means that we should look at study runtime not as an increasing
function of the data but as a decreasing function of the available data. And instead as an increasing
function of the true parameters of the machine learning problem of how hard the problem is to separate, for
example, in terms of separation in clusters or margin or any other parameter that really measures the
hardness of the problem or in terms of how good we want the solution in terms of the desired error.
And sort of this -- and what we saw here is we saw -- okay. Sorry. Maybe before -- this is an important
point. I talked about SVM and clustering, and I want to go back a bit to the clustering. So I talked here
about clustering. But really this type of behavior, this phase type transition of behavior appears, seems to
appear in many other problems clustering is a problem I studied the most as far as the very detailed
empirical study. I do want to point out several other problems. Again, we see this transition from the
problem being intractable when we have only limited amount of data to be intractable and we have tons of
data.
So this happens for learning structure dependency networks. We know the problem is hard in the worse
case but seems to be easy when you have tons and tons of data. So much data that we can actually
measure mutual informations very precisely.
The situation is not completely resolved yet, but I'm pretty sure that that it could be shown -- it can be
shown hopefully will be shown by somebody that this actually is easy in every case when we have enough
data.
(Inaudible) partitioning similar to clustering. And also solving discrete linear equations or problem set
problems. The problems are hard in the worse case, but seems we have tons and tons of data the
problems become easy. And interesting thing here to study is how does the problem go from being hard to
easy and characterizing what is this width of the hard region, what's the cost of tractability in establishing
that there is even just a hard component to it.
But going back here, so -- oops. What we see in many problems, continuous problems searching SVM and
combinatorial problems and clustering that we saw in other problems, the correct way to look at the
problem is to study how computation goes down with more data.
And this is something that unfortunately we don't have very good understanding of. We have a very good
understanding of how error goes down when you have more data. So there's decades of research in
statistics but also in statistical learning theory that tells us how much data we need to get a certain error.
How does the error decrease when we have more data.
But especially in today's world where we often have access to massive amounts of data, can often be
useful to use the access data not in order to get 0.001 decrease in error but rather to get a significant
decrease in runtime.
And that's something that's very interesting to study, both from an empirical perspective and a theoretical
perspective. And just want to again remind you that for specific for SVMs for linear kernels we sort of
solved this in terms of by showing the stochastic gradient descent has this behavior and I'm looking forward
to seeing methods for such behavior so for other continuous problems such as SVMs in which we get -again remind you this behavior on the left. And then understanding combinatorial problems, why exactly
characterizing the behavior that seems to happen in many combinatorial problems.
Okay. So I think that's it.
>> Ofer Dekel: Questions?
>>: So one thing that is sort of missing is the characterization of the meaningful minimum error. So in a lot
of problems we care more about data is because that actual . 001 matters because we're talking about a
class that's present 1% of the time or 1% of 1%.
And sub sampling data means throwing it out in multi-class problem if you have a power log and class
representation, it means that sub sampling you're throwing out the majority of all your classes. And yet
you're getting the big class but who cares. That's basic.
>> Nathan Srebo: So I mean the study that we present in particular, if we go back to theoretical analysis,
for example, does definitely include that error term in it. So that error term appears here. And if you really
care about that 0.001 it means that you really need to drive this lower and you're going to pay for that in
terms of runtime.
So this is definitely consistent with the study. There is, nevertheless, it may mean that you need a smaller
error. But for whatever error that you are willing to tolerate, for that error, if you have more data, the
runtime should decrease.
Now, if you're talking specifically about a problem where you say, okay, my expected error over all
examples doesn't really measure what I really want, because it's dominated by the big class, and what I
really want to do is find the rare things.
In that case, what you should do is look at a different error measure, the error measure that you do care
about. For example, something in terms of fair precision recall or whatever other measure that you have
that you measure how much that really captures those rare things.
And do a similar study for that error measure. So the study represented here is for using SVMs for
standard prediction task or what I care about is the expected error. So the basic idea and the basic
concept is still valid. I mean whatever error measure you care about, error measure precision recalling
measure, you still expect to get such behavior. The specifics of the math that I showed would not apply to
your setting specific for this SVM setting. Again, I believe you can come -- you did it enough you would be
able to derive the math that's specific to your setting. Does that answer the question?
>> I guess what I'm saying is that I'd rather have a characterization of how much error should I care about,
because I'm skeptical of deriving it for things like other measures because I continuously see (inaudible)
work on error instead of other measures because presumably that's what's easy to prove or actually what's
even in practice are these other measures.
>> Nathan Srebo: Okay.
>> I was listening if you had some insight how do we get a characterization of how much error should I
care about given things like, you know, class presence or other variables, because that's why usually you
retain all of them.
>> Nathan Srebo: So my answer to that would be that what is not how much error you should aim for in
that case, but in that case just looking at the error is the wrong thing. You should be looking at other
measures. And I think it's true that maybe if you look at theoretical work, it's probably true that it's sort of
gives an overemphasis to very straightforward error measures because they're much easier to work with.
Nevertheless, there are theoretical, there is theoretical work on things like precision recall and other
measures that do pay attention to rarer events.
I'm not specifically familiar with things to point out to you, but, again, my basic answer is you should study
performance in terms of the error measure that you do actually care about and is suited to your particular
need or application, rather than think about what error, what very, very small error is going to guarantee
what I want.
>> Here's another -- I want to sort of represent what you're saying in a different way. You can correct me if
I'm wrong. So would it be true to say that essentially if you have a large dataset in your sampling, which is
what Pegasus does, right? Then you're going to get like really say unbiased sample of the real world,
because if you actually have enough large amounts of dataset, you actually have the real world out there
that you're sampling from, whereas if you don't do sampling, if you just take everything into account, you're
not going to -- you're just going to K runtime (inaudible) so essentially think of it like when you have a small
dataset, you have potentially a biased sample of the world whereas if you have a large dataset which you
sample from, you probably get a closer to an unbiased sample.
>> Nathan Srebo: So definitely at least the second thing that we said exactly characterizes why Pegasus
gets this gain. Pegasus being a stochastic gradient method, stochastic methods, what we do we have a
big bucket of examples. And each iteration we pick one example from our bucket, look in it in our tiny
computation and put it back and then pick another example and repeat. And so the best thing we could
hope for is a bucket of examples we get fresh new samples from the world all the time. That's not the
situation, what we're given with the training set we're given a fixed bucket. The bigger that bucket is the
better it sort of represents the world.
So the better we're not going to pay for it in terms of runtime. The runtime is going to be the same. But
other things are going to be better. And so we're going to get better samples each iteration and be able to
kind of do a more effective step at each iteration. So that's definitely true. Comparing that to batch
methods is a bit trickier. So definitely stochastic generally stochastic methods seem to be dominant -- have
dominant performance in machine learning type of applications because in the sense we don't care about -so it's very difficult to get very small optimization accuracy with stochastic methods.
The convergence rate, if I was giving this talk in optimization conference, I would be kicked out of the room.
This was -- I don't know if there are any OR people here, if they're like they feel like oh what is he doing.
Because really getting, if we go to the runtime guarantees, I mean getting performance guarantee which is
scales one over the optimization error, this is horrible. So from a pure optimization perspective, stochastic
methods are problematic, and the reason I think why stochastic methods are better than batch methods,
and I do think this is a general principle, stochastic methods are preferable to batch methods in a learning
setting is because we really don't care about getting very small optimization error. There's no point in
getting optimization error which is much smaller than inherent error you have anyway because of
estimation error because of approximation error.
And so I'm not sure this -- this is, I think, a more subtle issue, because in some cases batch methods are
better. And it's not so clear to say. In learning settings, this type of analysis shows us that stochastic
methods dominate. Batch methods is also a nice analysis by lambda two from (inaudible) that was for
unregularized linear prediction and low dimensions. It also had similar to look at the data light in that case
also and got some results showing the stochastic methods seem to be particularly appropriate for machine
learning applications because of that property.
>> How do you think about (inaudible).
>> Nathan Srebo: I'm sorry? The data?
>> If you have incorrect label.
>> Nathan Srebo: So that would come out. Note that we are allowing our -- we're saying that we have
some predictor that has large margin but also has some error. So we are allowing the -- we are allowing
there to be noise in the system. Of course, the more noise you have in your labels, then the more noise
you're going to have also in your answers.
>> (Inaudible) I don't see how you can get enough in because ->> Nathan Srebo: You can. That's my problem. We can kernellize Pegasus but we don't get the same
runtime guarantees because we do get a runtime dependence.
>> So are they working (inaudible) transition, if there is any, or knowing what your (inaudible).
>> Nathan Srebo: I method is you should limit the number of support vectors you have. So you should
kernelize it. If you kernelize in a straightforward way, you will have limit the dependence. But you should
kernelize but be very reluctant to add new support vectors. I don't know how to do this. I'm going to say I
don't have an answer here for the kernelized case. Straightforward kernelization would have runtime
dependence. I strongly believe it's possible to come up with a natural method that would have a similar
scaling even for the kernelized case and I don't know how to do that.
If you have a good idea to do that ->> Some of the problems that are based on the inference and the assumption there if you have more data
it appears to be nominal because of the (inaudible). And that kind of fits in with the (inaudible) that the
problem fits in there because it's a normal distribution finding the model is much easier.
>> Nathan Srebo: I'm not sure -- maybe we can talk about this off line. I'm not sure I agree to that
because, for example, in all these problems you have multiple modes. And it's only normal around each
mode. And the real problem is finding the mode. But the real problem is finding ->> (Inaudible).
>> Nathan Srebo: Well, usually that's the hard part. I mean the combinatorial aspect of the problem,
finding the correct mode is the hard problem.
>> That's the big part. But like, say, the bayesian parts for the Gaussian, for instance.
>> The convex problems.
>> Not convex problems, but this is the iterations. The more successful approaches are the ones that are
approximate to Gaussian so you can actually prove posterior (inaudible) Gaussian model. Again, I'm
saying that that kind of fits in from bayesian perspective analysis rather than from an optimization
perspective, where, as you see more and more data, your posterior is easier to estimate, easier to
approximate.
>> Nathan Srebo: Maybe we can talk about this. Other questions?
(Applause)
Download