Document 17856971

advertisement
>> Sumit Gulwani: Hi everyone. I am very pleased to introduce Ankan Saha who is here visiting
us this summer from University of Chicago. He's been doing some cool internship work with
[inaudible] working on predictive web search now, but today he's going to talk about the usual
work that he's been doing for his PhD, optimization and I guess some Nesterov method stuff so
we are looking forward to an interesting talk.
>> Ankan Saha: Thanks. Thanks for the introduction Sumit. So this has been stuff that I have
been doing for like the last two years or so and I have been using some smoothing techniques
to basically optimize risk measures and look at various applications in machine learning where
this is applicable. At the heart of most machine learning problems you see regularized
minimization problems and more often than not this generally results in optimization problems
which are becoming increasingly larger in size, so just to give you a few examples, in the Hadron
Collider you'll have huge amounts of data, gigantic data sets of incredibly large dimensions. If
you look at the field of the internet, websites like Google, Facebook Twitter, they are collecting
gigantic amounts of data these days and more often than not these can often be very skewed
data sets with very large dimensions and maybe very few data points, whereas the converse
can also be true and it is often true actually. In MRI images as well as in most medical imaging
problems, you will actually come across the large piece small domain where basically the data is
very large dimensional but you don't actually get that many data points. With this deluge of big
data it is very important to come up with efficient optimization schemes which are also faster
and more importantly, you want to come up with optimization methods which are suited to the
specific problem and can actually give you faster rates of convergence or faster performance in
real-life. What we do in our work as we incorporate this new style of smoothing in various nonsmooth object designs. I'll define all of these terms as I go along. What we do is basically we
give faster rates of convergence on a large number of existing machine learning problems and
to give you a few examples our methods are applicable to simple binary SVMs, structured
maximizing prediction, problems of finding minimum enclosing shapes and something that I will
be talking about in detail, efficient smoothing of multivariate scores, so problems like
minimizing the ROC score and the precision recall breakeven point and so on. Before I go into
details I will give you a crash course in one or two slides of I mean, basically, the convex analysis
that I need for this talk. Most of you are aware a convex function can be lower bounded by
linear approximation. In our talk we will mostly be concerned with a few classes of convex
functions. Our universe will mostly consist of Lipschitz continuous functions, so basically the
function value is bounded by some constant times the actual distance of the point. Basically
the close up points will be closer in terms of function value as well. Another particular class of
functions that we will basically be interested in are strongly convex functions. These are
functions that are actually lower bounded by a quadratic rather than the simple linear functions
that the convex functions are lower bounded by, so you can think of any standard quadratic
function as a strongly convex function, or as an example of a strongly convex function, but in
particular, the definition is that instead of just linear approximation, you now have quadratic
term in the lower bound as well. To just form an analog, the analog is the class of functions
which are upper bounded by such a quadratic approximation, and these are what we call the
functions, these are exactly the functions actually have the Lipschitz continuous gradient and
they are also known in the machine learning trade as smooth functions. These are two
particularly well behaved function classes for which there are much better rates of convergence
known in the optimization literature, which is what we will be looking at. To give you an
example, this on the left-hand side is example of Lipschitz continuous gradient function, so this
is the original function and at any point you can actually draw a quadratic which will upper
bound the function. Similarly, for a strongly convex function if you take the original function
you can always form a quadratic which will lower bound the function, and then you can actually
optimize that particular quadratic function to give better rates of convergence. That
corresponds to this figure and to just give you an idea, this is the universal Lipschitz continuous
function that we are looking at and you can consider two classes of functions, strongly convex
functions and the smooth functions and there will of course be functions in the intermediate
regime which can be optimized much better, so I'll give you some idea of these state-of-the-art
rates. For standard Lipschitz continuous functions, mirror decent methods have already been
existing for the last 20 years or so and they give you order one by epsilon squared rates of
convergence to give, basically to go to an epsilon optimal solution, you would require order one
by epsilon square iterations. For strong…
>>: [inaudible] the concepts, sorry, so where it says epsilon, within epsilon X not in the F value?
>> Ankan Saha: No. Within epsilon the F value, so this is actually not epsilon close to the
optimal iterate, but epsilon close to the function value. All of the epsilon optimal things that I'll
talk about our in terms of function value. Basically for strongly convex functions if you have
both strongly convex as well as smooth function results, then you can show that simple
projected gradient descent methods can actually converge in order one log epsilon time,
whereas for general smooth functions, there have been these momentum-based methods that
have been pioneered by Nesterov for the last almost 30 years now that actually give you order
one by square root epsilon rates of convergence. But a lot of machine learning objectives are
actually non-smooth and therefore it remains a challenge as to how to apply some of these
rates into the machine learning problems in an effective way. I'll show you how we can--these
are generally black box rates, so you say if a function is strongly convex you can get these rates
of convergence. Yes?
>>: Will the second [inaudible]?
>> Ankan Saha: Yes. The second one projects both strong convex, so basically a well defined
condition. What you want is basically that if you have a particular structure defined in your
problems, you can actually get better rates of convergence then taking this towards the black
box case. That is what we will exploit in our various problems that we do. This is the structure
that I am actually talking about. So suppose you want to minimize non-differential objectives
which looks in this form. It is the sum of a strongly convex function F and the dual of a smooth
function G, so I'll use the term function with Lipschitz continuous gradient and smooth function
as, they basically mean the same thing, so I'll use them alternately. This object is actually the
sum of a strongly convex function and the dual of a smooth function, so to ease it out I will give
you various examples so most regularized risk minimization objects that you see in machine
learning can actually be expressed in this form. In particular, think of the standard SVM
objective. This is your L2 regularizer which is regularizer square, which is a strongly convex
term and the hinge loss can actually be expressed in this form and I will show you on one of the
later slides how it can be expressed. Most regularized objectives you can see they can actually
be expressed in this form and the dual can actually be written out like a max term, so if you
have an objective in this form you can come up with, I will show you how you can actually
smooth it to come up with better rates of convergence and what exists in the literature. If you
look at this particular objective, you will see that this odd max might not be unique and that
actually results in this function being non-differential and causing problems in the
differentiating the subject and getting a well-defined gradient and so on. The key difficulty as I
mentioned is because of the non-differentiability of the dual function G*, and here I was trying
to minimize with respect to W. what you can do is basically push this max over the variables u
outside and push the min over the W inside to get what would be the dual, basically the dual
objective and that actually looks like this. And it is interesting to note that if you have such a
well-defined primal objective, the dual over here is smooth, so basically this dual is a
differentiable, so basically it has a Lipschitz continuous gradient, and throughout the talk the W
will refer to the set of primary variables, so basically the primal objective, whereas u will refer
to the set of dual variables which is what I represented here. This is the, this is how the
problem looks geometrically. You have this function J that you want to minimize. It is nonsmooth so you have all of these kinks and there is this smooth function D which is its dual and
you want to find out if it's optimal. Now complex duality simply gives you that this primal will
always lie within the dual and so you are trying to get to this red star point. Now one of the
ways of smoothing, of basically optimizing this primal is to come up with a smooth surrogate of
it, and you can smooth the primal in several different ways. I will give you one particular
example that actually leads to faster rates in convergence. The idea is you want to add a
strongly convex function in the dual space capital D, so what I do here is previously you had this
objective where you had a maximum over this spot. Now I subtract out a strongly convex
function in this space, so the idea is if you subtract a strongly convex function, now your
[inaudible] Max becomes uniquely different and once your [inaudible] Max is uniquely defined
you can [inaudible] gradients, so this function J mu now is a smooth function. You can control
how much strong convexity you are adding by this parameter mu and if you send mu to zero,
basically you're getting the actual objective G. Yep?
>>: I'm sorry. I am just trying to follow your logic here. I thought you just said that you have
the primal problem is difficult but the dual, some of the dual you find the same, you can solve
the primal by solving the dual or the dual is mu?
>> Ankan Saha: Dual is mu but…
>>: So why then are you now smoothing the primal?
>> Ankan Saha: I want to solve the entire thing in the primal itself. I don't want to solve the
optimization problem in the dual.
>>: Why not?
>> Ankan Saha: There are two reasons for this. One is that if I solve the dual I will give you an
iterative method to convert to the optimum of the dual, but that inherently does not give you a
rate of how close, how fast you're going to the primal optimum, so there are a lot of iterations,
or there are a lot of algorithms where you would say I am going at order one by epsilon rate
close to the dual optimum, but suppose I give you an intermediate iterate corresponding to any
particular dual iterate so suppose you have an alpha key; corresponding to that I give you a W
key which is in the primal. This has no guarantee of how close WK is close to W Star or even
how close the primal JFWK is to JFW star. Whereas, what I will end up bounding is the duality
gap, so I will end up bounding at the primal iterate from the dual at the dual iterate, so the
distance from optimal is sandwiched in between, so of course you will be going to optimal in
that case.
>>: Isn't that because you have a specific preference towards these iterative methods that only
get close? I mean if you would really solve the dual, you would be done, right?
>> Ankan Saha: I don't know if the article bounds whether actually they can give you. Suppose
I say, okay I get a particular alpha case is that DF alpha, sorry DF alpha star minus DF alpha K is
less than or equal to epsilon. I don't know how to translate that to a bound from JFWK to JFW
star. That is what is often done in practice and it tends to give good results in practice, but I
don't know of any iterate, of any hierarchical bounds correspondingly, whereas, if you smooth
the primal I can show you that what you end up getting is a nicely sandwiched bound between
the primal and the dual. This kind of defines smooth approximation dual. This is how
[inaudible] looks to you. You have this non-smooth primal JFW. What I just showed you is I am
subtracting out non-negative strongly complex functions, small d which is also bounded by a
script d, so what we had here is that small d is not negative and is bounded by a script d. What
we are getting over here is that JF mu will always lie below JFW and if you add a mu times the
upper bound to small d over here it will always form an upper bound [inaudible] function, so
what we end up getting is we are creating a smooth envelope to our non-smooth function.
>>: These [inaudible]?
>> Ankan Saha: The small d is strongly convex.
>>: How can it be bounded if it is lower bounded by a [inaudible]?
>>: [inaudible] has to be a bounded [inaudible].
>> Ankan Saha: It has to be a bounded set.
[multiple speakers] [inaudible].
>> Ankan Saha: There is a [inaudible] over here so basically, oh, I forgot to mention that. You
are working in a bounded domain on the dual space basically, which is often the case. For
example, if you look at it as VM that dual space is basically a box, I mean for most machine
learning problems that I know of the dual is generally bounded or [inaudible] or something like
that. So you end up getting an upper bound as well, so the idea is as you optimize, as you
basically optimize this smooth function GF mu and try to send mu to zero at the same time, you
will end up optimizing the non-smooth objective. It will actually depend upon as a trade-off
between how well you are optimizing GF mu and how well you are sending mu to zero. The
duality gap is basically just value of the primal to the dual, so the difference between the two.
This can be upper bounded by this quantity just because JFWK is upper bounded by this step
and now you can basically say okay, these are two smooth functions. I have a handle on this so
basically I say I bound this by epsilon by two and I choose mu K such that this term is less than
or equal to epsilon by 2, then the entire thing is less than or equal to epsilon by two and that is
basically the entire trick behind smoothing techniques. And what Nesterov’s accelerated
gradient descent methods do are basically they try to optimize the smooth objective J of mu in
order one by square root of epsilon iterations. What we found out in the course of
experiments is that if you actually smooth out the objective in this way and solve the problem
in the primal, you can actually throw state-of-the-art very fast solvers, so in our experiment we
actually threw L-BFGS at the problem after smoothing it and it turns out it performs amazingly
fast compared to existing state-of-the-art methods of today. I will show you a slide for that as
well.
>>: [inaudible].
>> Ankan Saha: So we basically smoothed it out. Then we tried various ways of optimizing the
smooth objective. We tried Nesterov’s original algorithm as well. We tried CVS option as well.
We tried L-BFGS as well and we saw that basically all the advantages of fast solvers are actually
captured by this method. To go into some applications I will basically talk about smoothing out
multivariate methods. In particular I will just talk about ROC here today. The idea is you have
all of these measures like ROC [inaudible] and precision recall breakeven point which are kind
of very important in ideas like natural language processing, speech recognition and sometimes
envisioned where especially in the cases where the number of labels, the label classes are very
skewed. Suppose you have a very large amount of positive labels as compared to negative
labels it does not always make sense to capture accuracy so people often try to capture ROC
[inaudible] so. One of the problems with these kinds of complicated measures is that they
generally combine measures over the entire data set, and so more often than not they are not
individually additive over individual data points, so it's not like the hinge laws that you
calculated over individual data points. More often than not it is very difficult to apply online
learning algorithms to it. This is kind of a misleading statement. You can actually use online
learning algorithms if you define measures over pairs of data points for ROC score area, but I'm
right now talking about individual data points.
>>: [inaudible].
>> Ankan Saha: Pardon?
>>: You can transform the [inaudible] and use online while using your [inaudible] other point?
>> Ankan Saha: The one way I know of doing it is basically consider xi [inaudible] where xi is a
positive point and xj is a negative point and…
>>: [inaudible] like [inaudible]. But you can't seem to do that [inaudible].
>> Ankan Saha: Okay. Okay.
>>: [inaudible] can you do it [inaudible], so it's true to say that way.
>> Ankan Saha: Yeah, probably that's, I mean that was one of the reasons a lot of people were
initially asking well how do you justify your methods given that there are a lot of very fast
stochastically efficient methods around right now.
>>: [inaudible] if you doing [inaudible] you can do stochastic [inaudible] with [inaudible].
>> Ankan Saha: So basically end up going to a local minimum?
>>: [inaudible].
[laughter].
>>: [inaudible] you can do that [inaudible].
>> Ankan Saha: All right, so to go ahead with that, I will just explain briefly what ROC score
looks at, so the idea is you define the concept of misclassified pairs. A misclassified pair is a pair
i, j such that the score that your model gives to xi and xj is reversed to what are the actual
labels of y and yjr, so suppose yj is a negatively labeled point and y is a positively labeled point
but your model is actually getting lower score to xi as compared to xj. The ROC idea is basically
defined as one minus the fraction of such misclassified pairs, so here script P is basically the
number of positive points and script N is basically the class of positive points and the class of
negative points. And Parson Yokins [phonetic] had a very famous paper that was looking at
maximizing formulation for optimizing such multi variant scores and what he did was he
defined measures in products basically the product space of xi’s and xj’s. The idea was you
introduce these auxiliary variables z in this product space and what you do is you define a
margin based empirical risk. So what this empirical risk is basically capturing is very simple.
You want your score corresponding to, so I will always want, so i will always be a numerator
over positively labeled points and j will be a numerator over the negatively labeled points.
What you want is that your score over the positively labeled points should be greater than the
score over the negatively labeled points by some margin delta, so for this particular case, the
margin is just one. But if that is not the case, so if you incur a loss, you want to punish so you
are having these auxiliary variables z in, which are binary variables in zero one to the n. If this
term is actually positive, then your zij will actually be one. Whereas, in the other case your zij
will be zero, so if you incur loss your zij will be one and if you don't incur a loss, your zij will be
zero, so this is just trying to capture what the losses, and then you will be minimizing over the
particular model that you have. The goal is basically to find an epsilon accurate solution of such
an empirical risk that you can add a regularizer to that and you try to optimize it for the data
set. The state-of-the-art algorithms that existed before were due to general cutting plane
methods so something like bundled risk methods for regularized minimization. They used to
solve this problem in one order one by lambda epsilon time for convergence, but lambda is
basically the regularization parameter. If you are wondering, if the VM [inaudible] is exactly a
cutting plane methods which is basically often consider the state-of-the-art for optimizing this.
What we get is that our smoothing techniques help us get order one by square root lambda
epsilon rates of convergence for the same objective. We show that we are faster in practice as
well. We ride on this empirical loss as a maximum over these binary variables in the m
dimensional space and we note that this is actually equivalent to optimizing over fractional
variables beta which belong to the entire space here one to the m and you replace these edges
by the beta edges over here. Then we see that if you look at the entire regularized objective for
this Omega can be two num squared or something like that, this term can now be returned in
the form that we wanted it to be, so we have, we can actually write this as a sum of a strongly
convex function F which is just a regularizer and this empirical risk can now be returned as the
dual of a smooth function, where the dual is actually as simple as this. The dual is, I mean this is
just some convex analysis. The dual can be written as its average some over the dual variables
when they lie in the interval -1 to 1 and they are infinity otherwise. And this nice transform
comes out. This A is basically a transform on the primal variables. It is actually just a matrix
such that the ij column refers to this difference between xi and xj. After we observe this we
basically just apply Nesterov’s accelerated gradient scheme and use mu as equal to epsilon by
D, so note that mu is actually the rate at which we decrease our strongly convex content in the
dual is dependent on epsilon. That is how epsilon affects our solution. I won't go into the
details of the accelerated gradient scheme of Nesterov but what I want to say is that it is
analogous to gradient descent methods but what you do is you don't take the gradient of the
previous iterate; rather what you take is you take the gradient, add some kind of a combination
of the last two iterates. You can think of it as, so that is what is often referred to as
momentum-based methods in the optimization literature. So you take some kind affine
combination of the last two iterates and take the gradient function at that point. It is a order
method and is shown to be convergent in this many iterations, where script D is basically the
upper bound on the strongly convex function that we use and norm of A is basically the norm of
just this matrix A that we are talking about. It is interesting to note that if you just throw any
smoothing function to it, you will screw up this contribution that you have. This step D times
norm of A it is not necessarily a constant. It is not independent of the kind of smoothing that
you use. In particular, if we smooth without, so basically in our smoothing we had actually used
this contribution xi minus xj so our beta ij nicely couples up over the pairs of positive and
negative points, whereas, we could have introduced individual variables for the primal points
and individual variables for the negative points. It turns out that if you do something like that
your, this component ends up becoming dependent on a number of data points, so you will
actually end up incurring more dependence on the number of data points than what we get, so
basically in our case script D times norm of A is actually a constant, so our number of iterations
is actually completely independent of the number of data points. Whatever dependent on the
number of data points is we get is basically due to gradient compilation at the iteration and, so
it turns out that the smoothing needs to be done in a pretty non-intuitive way and you cannot
just say okay, we can throw any particular smoothing at it. To give you some intuition as to
how we look at the gradient evaluations, the gradient evaluations--since it is a first order
method it will of course go via calculating gradient at every step, and the gradient at every step
looks like this very complicated form. What it is doing is basically it is summing over pairs, I
mean overall positive points and inside that sum there is another sum of the negative points
and these alpha hat ij’s are actually medians of 1, 0 and this complicated term where aj comes
from the negative points and ai comes from the positive points and this basically comes up if
you look at how to optimize SVM [inaudible] as well, so basically it is the standard gradient
descent scheme that comes up in optimizing ROC score in general. If you actually tried to
calculate this gradient naïvely what you will see is that you might end up getting order and
square dependence because you're summing up over all the positive points and summing up
individually all of the negative points, and you have to do the same thing over here as well. So
if you try to calculate the gradients naïvely you end up getting order and square dependency, so
there was this famous algorithm by Parson himself. We actually see that it applies in our case
as well. What they end up doing is basically it ends up calculating this gradient in order nlogn
time. The idea is you sort these terms, aj’s and ai’s in terms of in an increasing order and then
you basically calculate these internal sums in linear time, so that is a pretty simple bookkeeping,
algorithm which sorts these particular aj’s and ai’s and you can see that it is very easy to keep
track of them so that you can calculate these individual sums in order n time once you have
sorted them, so basically the entire complexity is due to the sorting algorithm which gives you
order nlogn dependence to calculate the gradient. Yeah?
>>: [inaudible] example of the [inaudible] will you end up with pretty bad concentration or was
there a reason…
>> Ankan Saha: Theoretically you can sub sample, but depending upon the amount of
skewness in the labels that you have it can be a pretty bad estimate.
>>: So can you use a sample in a weighted way from, if you take all the positives and take a
sample of all the negatives and if it's negative and then properly risk it?
>> Ankan Saha: We tried simple learning scaling methods but that was not working well and
especially we tried doing experiments from the DNA data set which is actually a pretty skewed
one and in that case the results were not that good, if you were trying to calculate this nicely.
>>: Did you ever consider any guarantee of like I mean if you use that sample [inaudible] were
you [inaudible] which means that these gains could be better than by changing the objective?
>> Ankan Saha: Yeah. We did not go to the…
>>: I'm just wondering if you were to sample how tightly [inaudible] converge the to the true
[inaudible]?
[inaudible]. [multiple speakers].
>> Ankan Saha: So you have an independent subsampling before every iteration or you just
subsampling the beginning and…
>>: [inaudible] you just take the sample you get. You just draw a sample instead of taking the
full O nlogn for [inaudible].
>>: [inaudible] optimization process [inaudible] because of prime changes that [inaudible] you
end up [inaudible] gradients which is [inaudible] solution.
>> Ankan Saha: This is the gradient method that I was talking about. You end up getting an
order of nlogn complexity [inaudible] for calculating the gradient and interestingly the same
bookkeeping method actually allows you to calculate the smooth function value as well at every
iteration. The entire complexity [inaudible] turns out to be order nlogn. I just included this
slide to show you the variety of the data sets that we were looking at. In particular, this DNA is
a monster data set, I am sorry. This DNA string is a monster data set and it has a very skewed
ratio and there are some other very large data sets that we looked at as well. For example,
kddb was a really large and OCR, OCR was very large as well. What we did is we compared with
the standard cutting plane algorithms, in particular, bundled methods for regularized
minimization and we tried various possible standard algorithms after smoothing, but in
particular, the result that I'll show you was applying L-BFGS after you smooth algorithms. We
basically calculated our smooth loss and gradient evaluation using pet CN Tower libraries from
Argonne national labs. It helps in faster matrix vector multiplication and stuff like that. These
are some of the curves that we have. The blue curve corresponds to our algorithm and the red
curve corresponds to bundled methods and you can clearly see that both the top curves
correspond to basically how fast the regularizing is decreasing and this corresponds to
generalization performance. You can see in terms of both, the blue curve is way ahead of the
red curve in terms of convergence. We actually use like 80% of the data for training and 20%
for testing, so the generalization performance is pretty, significantly better than the bundled
structure. That's mostly about ROC [inaudible] problem. In the remainder of the talk I will try
to show you how this particular smoothing technique can be applied to various other problems
in… Yes, you have a question?
>>: You said, you compare with L-BFGS [inaudible] with this [inaudible]?
>> Ankan Saha: No. We actually smoothed the object and ran L-BFGS as our optimize
algorithm.
>>: [inaudible] the blue is…
>> Ankan Saha: Yeah, the blue is basically L-BFGS after applying smoothing [inaudible]
smoothing.
>>: Do you try [inaudible]?
>> Ankan Saha: Yeah [inaudible], we tried that but that was not as fast as [inaudible]. It was
still faster than the red line. It was still faster than [inaudible], other cutting methods in
general, which is kind of surprising because in various real-life experiments the bundled method
actually attain curves which are closer to order log 1 by epsilon as compared to order one by
epsilon, so their theoretical upper bound is order one by epsilon, but they are often in, in most
data sets their performances actually pretty close to log one by epsilon, which actually, initially
[inaudible] was conjecturing that probably it was a weakness in their bounds but then we
ended up showing lower bounds for bundled methods as well. That makes me wonder whether
this is an artifact, I mean if it is some kind of weakness in terms of smooth analysis or
something like that. Like these order one by square root of epsilon rates are optimum in terms
of first-order methods. That has been theoretically shown by number ROC but in many real live
data sets they actually perform amazingly better compared to order one by square root epsilon,
so if you look at the curves it is also as close to log one by epsilon, so I don't know if it's an
artifact of--or whether there is any stronger analysis and what that analysis would say.
>>: [inaudible] chosen?
>>: It suggests the upper bound.
>> Ankan Saha: Yes.
>>: You have [inaudible] particular problem, this upper bound [inaudible].
>> Ankan Saha: Yes, but then of course the lower bound is also for specific problems that you
handpick [inaudible] of course, in general cases it might be better. I'll talk about one of the,
one other problem that on the surface looks completely different, so this is the problem of
finding minimum enclosing convex shapes. In particular, the minimum enclosing ball problem.
The problem is very simple. You have been given n points in d dimensional space and you want
to find the smallest ball which encloses all of these points. There are a lot of implications in
data mining machine learning even statistics and I was surprised to see--I reached this problem
by looking at SVM solvers, so there is something called ball convex, core vector machine or ball
vector machine which came out around 2007. It was considered to be a very fast SVM solver;
what he did was basically solve the dual of the minimum enclosing ball problem and the
previous best algorithms for this are from the computational geometry community and they
have this scheme called finding the core sets which is--I won't go into the details of that but it is
a pretty constructive approximation scheme that gives you one plus epsilon multiplication
approximation guarantees in order nd by epsilon time where n is the number of points and d is
the dimension and epsilon is the actual experiment. What we end up showing is that we can
get to an epsilon approximation order nd by square root of the epsilon time, using the same
kind of smoothing techniques. To look at the particular problem, the minimum enclosing ball
problem can be formulated in this simple way. You have this unknown radius that you want to
minimize and you don't know the center of the ball as well, so these are your variables and
what you want to do is that the distance of every point from the center should be less than or
equal to the radius, so it is as simple as that. If you want to minimize r over R squared for all
xi’s, you might as well write it down as over maximum overall the xi’s over this term. If you
open that out, that actually looks like this. This is a maximum over xi’s belonging, so basically
the number of xi’s lying on a set so you can equally replace it on, by a variable lying on a
simplex because the optimum of a variable on simplex will actually lie at the corners, so you
end up getting something like this. You can clearly see that this is the sum of a strongly convex
function and the exact dual formulation that I was talking about. Now you're dual variable
actually lies in the simplex which is again a closed space, a compact space. Once you have this
formulation you can basically smooth it out and throw Nesterov’s accelerated scheme method.
In particular we use a variant of Nesterov’s accelerated gradient scheme. We just call it the
accelerated gad scheme for this problem. It is pretty similar and the corresponding dual
function looks like this form, so notice that the dual function is moved over here and in this
case the duality gap is again given by this quantity and basically the entire rate is basically
obtained by how fast you are shrinking the strong convexity parameter to another strong
convexity that you are adding to zero. Basically mu K is going to zero at a quadratic rate over
here, so that is giving you the rate of convergence. In this case Sigma is basically the strong
convexity parameter, of the strong [inaudible] function that you are adding. Script D is upper
bound on that small d function and you can just think of L as norm of A transpose. These are
the various parameters that come in. The reason I kept all of these parameters in the rate is
that these parameters are important. If you don't smooth in the proper way these parameters
will throw in variance dependence on the number of data points or the dimension or something
like that and will result in sub optimal rates. It's important to optimize it very carefully so that
you end up getting the best possible results. The reason I'm mentioning is I'll give you an
example for this.
>>: [inaudible] bigger sigma than this [inaudible]? So you should gain by getting a larger sigma.
>> Ankan Saha: Yeah, but if you have such a d which has a very strong convexity parameter
then this [inaudible] might increase as well. We end up getting a bound on script D [inaudible].
Depending upon, so basically I'll give you two examples to illustrate this point. This small d
function which is a strongly converse function is often referred to as prox function in the
literature, so depending on what prox function you use you might get different kinds of rates,
so of course we want to add a strong convex function, but a strongly convex function, the
definition of a strongly convex function depends on the domain that you are working from.
Suppose that currently my domain is just simplex, but suppose I endowed the simplex with d L
2 mu; in that case the natural definition of a strongly convex function to add is the 2 norm
spread, so suppose I had added a 2 norm distance from the center of the simplex. In that case
my script, the upper bound script D would have been sigma by 2, but this Lipschitz constant of
the dual will actually now have a dependence on the maximum eigenvalue of the AA transport
matrix and in general, if your AA is an n by d dimensional matrix this thing can have a
dependence on the number of data points. In particular, the convergence bound that you will
get will be dependent upon this script L which actually will have a dependence on n in the worst
case. Whereas, if you actually endow the simplex with the L1 norm and choose your prox as
the entropy function, which is strongly convex with respect to the simplex, basically the upper
bound now incurs a logarithmic dependence on the number of data points, but this Lipschitz
constant now becomes independent of the number of data points so it is just upper bounded
by the maximum ball of the individual data points. In this case basically you end up getting just
a logarithm dependence on the number of data points which gives us the order star one by
square root epsilon rates of dependence, so this is exactly the bound that we get for our L
[inaudible]. This will actually converge to epsilon accuracy in order one by the square root
sigma time. As I mentioned, the problem of minimum enclosing ball can actually be applied to
various other problems, in particular, it can be applied to the problem of finding, optimizing
support vector machines, so in particular, this is the simple objective of support vector
machines. You have the hinge loss plus a regularizer. The SVMs are often solved in the dual in
existing literature and what was once, I mean what was done around 2005 and 2007 was
people noticed that if you take the dual of the L2 SVM where you are actually working with the
hinge loss squared, the dual of that particular problem actually looks exactly like the dual of the
minimum enclosing ball problem. In that case people actually used the previous algorithms for
existing minimum enclosing ball, and they just used it to solve the SVM dual, and they noticed
that it actually gives you fast convergence in real life when you have large data sets. I think at
one point of time core vector machines were the fastest SVM solvers in practice, so that was
around ’07 or something. They were previously using core vector algorithm so they were
getting order one by epsilon rates of convergence. If you use our scheme and just plug it as a
subroutine for optimizing this, you will end up getting order one by square root epsilon rates of
convergence. That is for core vector machines and one thing that might be of discord is that
what we are doing here corresponds to L2 SVM, so basically hinge loss quiet. It turns out that
you can't even solve standard SVM, so basically the L1 SVM, where you just have the hinge loss
and you can solve it using these Nesterov style smoothing. What is obtained is that just to give
you some introduction, previous state-of-the-art batch solvers, so basically bundle returns also
solve the L1 SVM optimization problem and they give you order nd by lambda epsilon rates.
You can also use stochastic methods like stochastic grade and descend Pegasos to solve the L1
SVM problem. They actually get independence of the number of data points but they will give
you order d by lambda epsilon rates for convergence. What we did was we wrote down the L1
SVM objective again as strongly convex function and the hinge loss as a dual of a smooth
function. It turns out that this function, the original function g, is as simple as the g of alpha is
the summation of alpha i and if you take the dual of that you end up getting this [inaudible]. In
this case the transform A is basically the product of the Y times X matrix where X is the data
points stacked up in a matrix. Basically since your SVM can be written down in this formulation,
you can just use Nesterov’s smoothing and then use any of the accelerated gradient descent
scheme or even proximal gradient descent scheme which is another variant of Nesterov’s
methods. They will both give you order n [inaudible] square root epsilon rates of convergence.
Yes?
>>: [inaudible] generalization [inaudible] epsilon yet then n the chosen [inaudible] something
about [inaudible]. [inaudible] and then compute the running time that you need and get the
same for [inaudible] and what happens there and that will be the correct [inaudible].
>> Ankan Saha: Yeah, so I mean one of the reasons we actually had this around 2009, but we
had compared--it is hard to put this into perspective with Pegasos, so what we did we
compared with a batch version of Pegasos and we beat the batch version of Pegasos
straightaway, but then everybody was why are you comparing with a batch version of Pegasos.
Pegasos is supposed to be an online algorithm and our method is strictly a batch algorithm so
comparing it with the Pegasos algorithm which chooses one point at every iteration was like
comparing apples and oranges, so this point that you made was exactly what we did after that,
so it turns out that our method like in terms of experiments, we are actually kind of at the same
league as Pegasos. I'll show you quite a few algorithms in the next slide that we compared it
with. These are a bunch of algorithms that we compared it with. Batch Pegasos is basically the
red line that we have. This is showing the primal function versus the number of iterations that
it required to converge. Pegasos is basically the redline and ours is the blue line, so we did
considerably better than the batch version of Pegasos but there is this algorithm Lib Linear
which you cannot even see over here. It is somewhere over here, so that is like way faster than
anything else that we, way faster than our method, way faster than any of the existing batch
solvers that we tried to compare it with and it basically uses something like an implicit method
in the dual so what it does is so this is the Lib Linear algorithm by [inaudible]. What they do is
they look at the dual and update one coordinate at a time in the dual, so basically the dual
coordinate is [inaudible] the dual and make the corresponding updates in the primal. It turns
out they are way faster than any of the other algorithms, but then our objective over here was
that we actually get faster rates than all of the existing standard one by epsilon algorithms that
are existing in the literature. So yes, that is as far as this is concerned and so these are some
other algorithms. These are some other plots. We're just basically showing how fast the primal
function decreases in terms of the time in the particular seconds. One thing that should be
noted is that our method over here is bounding the duality gap, whereas all of the other
algorithms that we compare with, they are considered dual algorithms so they will give you a
bound on d of alpha star minus d of Alpha K where alpha star is chosen according to some
ground truth and then they just take that alpha K and get a WK corresponding to that using the
same formula that is true only at the optimum. Theoretically, there is no bound to say how far
that WK is actually from W Star. In experiments they performed reasonably well, pretty well, so
that is what is generally taken as ground truth, but that actually is efficient that we are both
theoretically as well as practically because you are bounding the duality gap and the distance
optimum is sandwiched over there. I am pretty close to the end of my talk. Some other
applications where I actually applied these things is basically the problem of finding the
minimum enclosing convex polytope which is again, our computation is a geometry problem so
basically the problem is you have a set of points, and you have been given a fixed polytope and
you want to find what is the minimum magnification of the polytope that is required to enclose
those points. It turns out this problem can also be applied to certain active learning problems
and it will actually be used by Dan Roth in finding active learning algorithms for basically active
learning SVM problems and you can convert it into the convex [inaudible] problem. And our
methods can also be applied to other maximizing base problems, in particular, if you look at
structured SVMs, then our methods go through over there as well and it goes through over the
entire basically calculating all of the marginals and sub marginals can be done efficiently in our
setting as well and we end up getting better rates than the state-of-the-art methods for
structured SVM's are all again order one by epsilon due to Michael Collins and some of his coauthors. We beat that and got order one by square root epsilon rates and there is a problem of
finding polytope distance which is basically looking at two polytopes and finding the distance
them and that is also a place where we can apply our algorithms and get--so basically all of
these are algorithms previously had rates of order one by epsilon to get the epsilon close to
optimum. We improve all of them to get to order one by square root epsilon rates. In
conclusion, what we showed is that we can get improved rates for various machine learning
objectives by using this particular smoothing style that we have due to Nesterov and it gives
you--the key message is that you need not necessarily use Nesterov’s algorithm itself. The key
thing that actually happens is the particular kind of smoothing that you want to use, so suppose
you smooth an object. If you can throw any good solver after that and can get very good
experimental convergence in practice and this is true for very large data sets as well, so it is
often very relevant. As part of future work we want to see if these appropriate smoothing can
mimic the performance of second-order methods since we are already using L-BFGS and we
basically say that okay, if we use smooth in this particular way we can actually just throw our
second orders over and not incur the cost in calculating the hessian, so basically, we get
performance as good as a second order solver but just by doing smoothing and then throwing
something which is cheaper to calculate at every iteration. Another thing that we have done is
basically applying these smoothing techniques to more complicated measures like ranking
measures like NDCG and precision@k or even F1 score, so these smoothing methods that I
talked about, currently we can do it efficiently only for ROC [inaudible] and precision recall
breakeven point. It turns out we have to, I mean you have to handpick the smoothing
technique for each of them. There is no general smoothing technique that will apply across the
board. So far if something as simple as--as generated F1 score, we don't have a smoothing
technique that gives you better rates. So that is also a potential future work that is existing
right now. Yeah, so that's the end of the talk.
[applause].
>> Ankan Saha: Any questions?
>>: On the last experiment you showed that where the [inaudible] fast [inaudible] so since you
already…
>> Ankan Saha: [inaudible] linear.
>>: Since you already smoothed [inaudible] here lower batch, did you try the L-BFGS on that
one?
>> Ankan Saha: We tried L-BFGS on that. It was much faster than our algorithm. So this graph
that I showed, the blue line, is due to Nesterov’s algorithm; it is not due to L-BFGS. It comes
closer to Lib Linear but it's still, I mean Lib Linear is still faster, but that is because the linear
cord that is in the repositories of very heavily optimized code. I mean it uses a lot of caches and
stuff like that and it does--it has like a separate cache that to do something and so on. I'm not
exactly sure why Lib Linear is so fast. It says that theoretically it should be log one.
>>: [inaudible]. [inaudible] gradient just to blend the volume [inaudible].
>>: [inaudible] goes faster than the [inaudible].
>>: And the average [inaudible].
>>: Yes, yes [inaudible].
>>: [inaudible] John performed this.
>>: [inaudible] call him John [inaudible]. [laughter].
>>: [inaudible] after he run the L-BFGS [inaudible].
>>: Are you doing something different from Pegasos or…
>> Ankan Saha: I don't waste my time looking for [inaudible]. [inaudible] each [inaudible].
>>: [inaudible].
>>: What's the power [inaudible] and learning [inaudible]?
>>: [inaudible] the same. It's just a starting point for those.
>> Ankan Saha: Yeah. We did not compare with [inaudible].
>>: I hope you have [inaudible] here. There is still epsilon where the…
>>: [inaudible] if you want. It still [inaudible] direction [inaudible].
>>: So I think it's square root epsilon [inaudible].
>>: Square root of epsilon [inaudible].
>>: Sorry, epsilon squared.
>>: [inaudible].
>> Ankan Saha: [inaudible] for experiments we tried but it is [inaudible] impossible [inaudible]
and so basically we would start with [inaudible] epsilon by [inaudible] and then take 10 times
that 10 squared times that times one [inaudible].
>>: I think the reason why [inaudible] to be epsilon [inaudible] is you want to prove the rate
and [inaudible]. When you apply in the L-BFGS, I suspect that if you are starting over
[inaudible] mu, then gradually you increase that.
>> Ankan Saha: But there is a sweet spot. If you can [inaudible] increasing it at some point of
time it becomes worse, so basically I think it's an artifact of the epsilon that you want to go to
as, how close you want to go to the optimal solution as well, and it's not necessarily a welldefined thing. If you work with 500 points and you work with a particular epsilon, the mu that
you need to set it for is often different than when you are working with 1 million points. You
know what I mean?
>> Sumit Gulwani: Any more questions?
>> Ankan Saha: Thanks. [applause].
Download