>> Sumit Gulwani: Hi everyone. I am very pleased to introduce Ankan Saha who is here visiting us this summer from University of Chicago. He's been doing some cool internship work with [inaudible] working on predictive web search now, but today he's going to talk about the usual work that he's been doing for his PhD, optimization and I guess some Nesterov method stuff so we are looking forward to an interesting talk. >> Ankan Saha: Thanks. Thanks for the introduction Sumit. So this has been stuff that I have been doing for like the last two years or so and I have been using some smoothing techniques to basically optimize risk measures and look at various applications in machine learning where this is applicable. At the heart of most machine learning problems you see regularized minimization problems and more often than not this generally results in optimization problems which are becoming increasingly larger in size, so just to give you a few examples, in the Hadron Collider you'll have huge amounts of data, gigantic data sets of incredibly large dimensions. If you look at the field of the internet, websites like Google, Facebook Twitter, they are collecting gigantic amounts of data these days and more often than not these can often be very skewed data sets with very large dimensions and maybe very few data points, whereas the converse can also be true and it is often true actually. In MRI images as well as in most medical imaging problems, you will actually come across the large piece small domain where basically the data is very large dimensional but you don't actually get that many data points. With this deluge of big data it is very important to come up with efficient optimization schemes which are also faster and more importantly, you want to come up with optimization methods which are suited to the specific problem and can actually give you faster rates of convergence or faster performance in real-life. What we do in our work as we incorporate this new style of smoothing in various nonsmooth object designs. I'll define all of these terms as I go along. What we do is basically we give faster rates of convergence on a large number of existing machine learning problems and to give you a few examples our methods are applicable to simple binary SVMs, structured maximizing prediction, problems of finding minimum enclosing shapes and something that I will be talking about in detail, efficient smoothing of multivariate scores, so problems like minimizing the ROC score and the precision recall breakeven point and so on. Before I go into details I will give you a crash course in one or two slides of I mean, basically, the convex analysis that I need for this talk. Most of you are aware a convex function can be lower bounded by linear approximation. In our talk we will mostly be concerned with a few classes of convex functions. Our universe will mostly consist of Lipschitz continuous functions, so basically the function value is bounded by some constant times the actual distance of the point. Basically the close up points will be closer in terms of function value as well. Another particular class of functions that we will basically be interested in are strongly convex functions. These are functions that are actually lower bounded by a quadratic rather than the simple linear functions that the convex functions are lower bounded by, so you can think of any standard quadratic function as a strongly convex function, or as an example of a strongly convex function, but in particular, the definition is that instead of just linear approximation, you now have quadratic term in the lower bound as well. To just form an analog, the analog is the class of functions which are upper bounded by such a quadratic approximation, and these are what we call the functions, these are exactly the functions actually have the Lipschitz continuous gradient and they are also known in the machine learning trade as smooth functions. These are two particularly well behaved function classes for which there are much better rates of convergence known in the optimization literature, which is what we will be looking at. To give you an example, this on the left-hand side is example of Lipschitz continuous gradient function, so this is the original function and at any point you can actually draw a quadratic which will upper bound the function. Similarly, for a strongly convex function if you take the original function you can always form a quadratic which will lower bound the function, and then you can actually optimize that particular quadratic function to give better rates of convergence. That corresponds to this figure and to just give you an idea, this is the universal Lipschitz continuous function that we are looking at and you can consider two classes of functions, strongly convex functions and the smooth functions and there will of course be functions in the intermediate regime which can be optimized much better, so I'll give you some idea of these state-of-the-art rates. For standard Lipschitz continuous functions, mirror decent methods have already been existing for the last 20 years or so and they give you order one by epsilon squared rates of convergence to give, basically to go to an epsilon optimal solution, you would require order one by epsilon square iterations. For strong… >>: [inaudible] the concepts, sorry, so where it says epsilon, within epsilon X not in the F value? >> Ankan Saha: No. Within epsilon the F value, so this is actually not epsilon close to the optimal iterate, but epsilon close to the function value. All of the epsilon optimal things that I'll talk about our in terms of function value. Basically for strongly convex functions if you have both strongly convex as well as smooth function results, then you can show that simple projected gradient descent methods can actually converge in order one log epsilon time, whereas for general smooth functions, there have been these momentum-based methods that have been pioneered by Nesterov for the last almost 30 years now that actually give you order one by square root epsilon rates of convergence. But a lot of machine learning objectives are actually non-smooth and therefore it remains a challenge as to how to apply some of these rates into the machine learning problems in an effective way. I'll show you how we can--these are generally black box rates, so you say if a function is strongly convex you can get these rates of convergence. Yes? >>: Will the second [inaudible]? >> Ankan Saha: Yes. The second one projects both strong convex, so basically a well defined condition. What you want is basically that if you have a particular structure defined in your problems, you can actually get better rates of convergence then taking this towards the black box case. That is what we will exploit in our various problems that we do. This is the structure that I am actually talking about. So suppose you want to minimize non-differential objectives which looks in this form. It is the sum of a strongly convex function F and the dual of a smooth function G, so I'll use the term function with Lipschitz continuous gradient and smooth function as, they basically mean the same thing, so I'll use them alternately. This object is actually the sum of a strongly convex function and the dual of a smooth function, so to ease it out I will give you various examples so most regularized risk minimization objects that you see in machine learning can actually be expressed in this form. In particular, think of the standard SVM objective. This is your L2 regularizer which is regularizer square, which is a strongly convex term and the hinge loss can actually be expressed in this form and I will show you on one of the later slides how it can be expressed. Most regularized objectives you can see they can actually be expressed in this form and the dual can actually be written out like a max term, so if you have an objective in this form you can come up with, I will show you how you can actually smooth it to come up with better rates of convergence and what exists in the literature. If you look at this particular objective, you will see that this odd max might not be unique and that actually results in this function being non-differential and causing problems in the differentiating the subject and getting a well-defined gradient and so on. The key difficulty as I mentioned is because of the non-differentiability of the dual function G*, and here I was trying to minimize with respect to W. what you can do is basically push this max over the variables u outside and push the min over the W inside to get what would be the dual, basically the dual objective and that actually looks like this. And it is interesting to note that if you have such a well-defined primal objective, the dual over here is smooth, so basically this dual is a differentiable, so basically it has a Lipschitz continuous gradient, and throughout the talk the W will refer to the set of primary variables, so basically the primal objective, whereas u will refer to the set of dual variables which is what I represented here. This is the, this is how the problem looks geometrically. You have this function J that you want to minimize. It is nonsmooth so you have all of these kinks and there is this smooth function D which is its dual and you want to find out if it's optimal. Now complex duality simply gives you that this primal will always lie within the dual and so you are trying to get to this red star point. Now one of the ways of smoothing, of basically optimizing this primal is to come up with a smooth surrogate of it, and you can smooth the primal in several different ways. I will give you one particular example that actually leads to faster rates in convergence. The idea is you want to add a strongly convex function in the dual space capital D, so what I do here is previously you had this objective where you had a maximum over this spot. Now I subtract out a strongly convex function in this space, so the idea is if you subtract a strongly convex function, now your [inaudible] Max becomes uniquely different and once your [inaudible] Max is uniquely defined you can [inaudible] gradients, so this function J mu now is a smooth function. You can control how much strong convexity you are adding by this parameter mu and if you send mu to zero, basically you're getting the actual objective G. Yep? >>: I'm sorry. I am just trying to follow your logic here. I thought you just said that you have the primal problem is difficult but the dual, some of the dual you find the same, you can solve the primal by solving the dual or the dual is mu? >> Ankan Saha: Dual is mu but… >>: So why then are you now smoothing the primal? >> Ankan Saha: I want to solve the entire thing in the primal itself. I don't want to solve the optimization problem in the dual. >>: Why not? >> Ankan Saha: There are two reasons for this. One is that if I solve the dual I will give you an iterative method to convert to the optimum of the dual, but that inherently does not give you a rate of how close, how fast you're going to the primal optimum, so there are a lot of iterations, or there are a lot of algorithms where you would say I am going at order one by epsilon rate close to the dual optimum, but suppose I give you an intermediate iterate corresponding to any particular dual iterate so suppose you have an alpha key; corresponding to that I give you a W key which is in the primal. This has no guarantee of how close WK is close to W Star or even how close the primal JFWK is to JFW star. Whereas, what I will end up bounding is the duality gap, so I will end up bounding at the primal iterate from the dual at the dual iterate, so the distance from optimal is sandwiched in between, so of course you will be going to optimal in that case. >>: Isn't that because you have a specific preference towards these iterative methods that only get close? I mean if you would really solve the dual, you would be done, right? >> Ankan Saha: I don't know if the article bounds whether actually they can give you. Suppose I say, okay I get a particular alpha case is that DF alpha, sorry DF alpha star minus DF alpha K is less than or equal to epsilon. I don't know how to translate that to a bound from JFWK to JFW star. That is what is often done in practice and it tends to give good results in practice, but I don't know of any iterate, of any hierarchical bounds correspondingly, whereas, if you smooth the primal I can show you that what you end up getting is a nicely sandwiched bound between the primal and the dual. This kind of defines smooth approximation dual. This is how [inaudible] looks to you. You have this non-smooth primal JFW. What I just showed you is I am subtracting out non-negative strongly complex functions, small d which is also bounded by a script d, so what we had here is that small d is not negative and is bounded by a script d. What we are getting over here is that JF mu will always lie below JFW and if you add a mu times the upper bound to small d over here it will always form an upper bound [inaudible] function, so what we end up getting is we are creating a smooth envelope to our non-smooth function. >>: These [inaudible]? >> Ankan Saha: The small d is strongly convex. >>: How can it be bounded if it is lower bounded by a [inaudible]? >>: [inaudible] has to be a bounded [inaudible]. >> Ankan Saha: It has to be a bounded set. [multiple speakers] [inaudible]. >> Ankan Saha: There is a [inaudible] over here so basically, oh, I forgot to mention that. You are working in a bounded domain on the dual space basically, which is often the case. For example, if you look at it as VM that dual space is basically a box, I mean for most machine learning problems that I know of the dual is generally bounded or [inaudible] or something like that. So you end up getting an upper bound as well, so the idea is as you optimize, as you basically optimize this smooth function GF mu and try to send mu to zero at the same time, you will end up optimizing the non-smooth objective. It will actually depend upon as a trade-off between how well you are optimizing GF mu and how well you are sending mu to zero. The duality gap is basically just value of the primal to the dual, so the difference between the two. This can be upper bounded by this quantity just because JFWK is upper bounded by this step and now you can basically say okay, these are two smooth functions. I have a handle on this so basically I say I bound this by epsilon by two and I choose mu K such that this term is less than or equal to epsilon by 2, then the entire thing is less than or equal to epsilon by two and that is basically the entire trick behind smoothing techniques. And what Nesterov’s accelerated gradient descent methods do are basically they try to optimize the smooth objective J of mu in order one by square root of epsilon iterations. What we found out in the course of experiments is that if you actually smooth out the objective in this way and solve the problem in the primal, you can actually throw state-of-the-art very fast solvers, so in our experiment we actually threw L-BFGS at the problem after smoothing it and it turns out it performs amazingly fast compared to existing state-of-the-art methods of today. I will show you a slide for that as well. >>: [inaudible]. >> Ankan Saha: So we basically smoothed it out. Then we tried various ways of optimizing the smooth objective. We tried Nesterov’s original algorithm as well. We tried CVS option as well. We tried L-BFGS as well and we saw that basically all the advantages of fast solvers are actually captured by this method. To go into some applications I will basically talk about smoothing out multivariate methods. In particular I will just talk about ROC here today. The idea is you have all of these measures like ROC [inaudible] and precision recall breakeven point which are kind of very important in ideas like natural language processing, speech recognition and sometimes envisioned where especially in the cases where the number of labels, the label classes are very skewed. Suppose you have a very large amount of positive labels as compared to negative labels it does not always make sense to capture accuracy so people often try to capture ROC [inaudible] so. One of the problems with these kinds of complicated measures is that they generally combine measures over the entire data set, and so more often than not they are not individually additive over individual data points, so it's not like the hinge laws that you calculated over individual data points. More often than not it is very difficult to apply online learning algorithms to it. This is kind of a misleading statement. You can actually use online learning algorithms if you define measures over pairs of data points for ROC score area, but I'm right now talking about individual data points. >>: [inaudible]. >> Ankan Saha: Pardon? >>: You can transform the [inaudible] and use online while using your [inaudible] other point? >> Ankan Saha: The one way I know of doing it is basically consider xi [inaudible] where xi is a positive point and xj is a negative point and… >>: [inaudible] like [inaudible]. But you can't seem to do that [inaudible]. >> Ankan Saha: Okay. Okay. >>: [inaudible] can you do it [inaudible], so it's true to say that way. >> Ankan Saha: Yeah, probably that's, I mean that was one of the reasons a lot of people were initially asking well how do you justify your methods given that there are a lot of very fast stochastically efficient methods around right now. >>: [inaudible] if you doing [inaudible] you can do stochastic [inaudible] with [inaudible]. >> Ankan Saha: So basically end up going to a local minimum? >>: [inaudible]. [laughter]. >>: [inaudible] you can do that [inaudible]. >> Ankan Saha: All right, so to go ahead with that, I will just explain briefly what ROC score looks at, so the idea is you define the concept of misclassified pairs. A misclassified pair is a pair i, j such that the score that your model gives to xi and xj is reversed to what are the actual labels of y and yjr, so suppose yj is a negatively labeled point and y is a positively labeled point but your model is actually getting lower score to xi as compared to xj. The ROC idea is basically defined as one minus the fraction of such misclassified pairs, so here script P is basically the number of positive points and script N is basically the class of positive points and the class of negative points. And Parson Yokins [phonetic] had a very famous paper that was looking at maximizing formulation for optimizing such multi variant scores and what he did was he defined measures in products basically the product space of xi’s and xj’s. The idea was you introduce these auxiliary variables z in this product space and what you do is you define a margin based empirical risk. So what this empirical risk is basically capturing is very simple. You want your score corresponding to, so I will always want, so i will always be a numerator over positively labeled points and j will be a numerator over the negatively labeled points. What you want is that your score over the positively labeled points should be greater than the score over the negatively labeled points by some margin delta, so for this particular case, the margin is just one. But if that is not the case, so if you incur a loss, you want to punish so you are having these auxiliary variables z in, which are binary variables in zero one to the n. If this term is actually positive, then your zij will actually be one. Whereas, in the other case your zij will be zero, so if you incur loss your zij will be one and if you don't incur a loss, your zij will be zero, so this is just trying to capture what the losses, and then you will be minimizing over the particular model that you have. The goal is basically to find an epsilon accurate solution of such an empirical risk that you can add a regularizer to that and you try to optimize it for the data set. The state-of-the-art algorithms that existed before were due to general cutting plane methods so something like bundled risk methods for regularized minimization. They used to solve this problem in one order one by lambda epsilon time for convergence, but lambda is basically the regularization parameter. If you are wondering, if the VM [inaudible] is exactly a cutting plane methods which is basically often consider the state-of-the-art for optimizing this. What we get is that our smoothing techniques help us get order one by square root lambda epsilon rates of convergence for the same objective. We show that we are faster in practice as well. We ride on this empirical loss as a maximum over these binary variables in the m dimensional space and we note that this is actually equivalent to optimizing over fractional variables beta which belong to the entire space here one to the m and you replace these edges by the beta edges over here. Then we see that if you look at the entire regularized objective for this Omega can be two num squared or something like that, this term can now be returned in the form that we wanted it to be, so we have, we can actually write this as a sum of a strongly convex function F which is just a regularizer and this empirical risk can now be returned as the dual of a smooth function, where the dual is actually as simple as this. The dual is, I mean this is just some convex analysis. The dual can be written as its average some over the dual variables when they lie in the interval -1 to 1 and they are infinity otherwise. And this nice transform comes out. This A is basically a transform on the primal variables. It is actually just a matrix such that the ij column refers to this difference between xi and xj. After we observe this we basically just apply Nesterov’s accelerated gradient scheme and use mu as equal to epsilon by D, so note that mu is actually the rate at which we decrease our strongly convex content in the dual is dependent on epsilon. That is how epsilon affects our solution. I won't go into the details of the accelerated gradient scheme of Nesterov but what I want to say is that it is analogous to gradient descent methods but what you do is you don't take the gradient of the previous iterate; rather what you take is you take the gradient, add some kind of a combination of the last two iterates. You can think of it as, so that is what is often referred to as momentum-based methods in the optimization literature. So you take some kind affine combination of the last two iterates and take the gradient function at that point. It is a order method and is shown to be convergent in this many iterations, where script D is basically the upper bound on the strongly convex function that we use and norm of A is basically the norm of just this matrix A that we are talking about. It is interesting to note that if you just throw any smoothing function to it, you will screw up this contribution that you have. This step D times norm of A it is not necessarily a constant. It is not independent of the kind of smoothing that you use. In particular, if we smooth without, so basically in our smoothing we had actually used this contribution xi minus xj so our beta ij nicely couples up over the pairs of positive and negative points, whereas, we could have introduced individual variables for the primal points and individual variables for the negative points. It turns out that if you do something like that your, this component ends up becoming dependent on a number of data points, so you will actually end up incurring more dependence on the number of data points than what we get, so basically in our case script D times norm of A is actually a constant, so our number of iterations is actually completely independent of the number of data points. Whatever dependent on the number of data points is we get is basically due to gradient compilation at the iteration and, so it turns out that the smoothing needs to be done in a pretty non-intuitive way and you cannot just say okay, we can throw any particular smoothing at it. To give you some intuition as to how we look at the gradient evaluations, the gradient evaluations--since it is a first order method it will of course go via calculating gradient at every step, and the gradient at every step looks like this very complicated form. What it is doing is basically it is summing over pairs, I mean overall positive points and inside that sum there is another sum of the negative points and these alpha hat ij’s are actually medians of 1, 0 and this complicated term where aj comes from the negative points and ai comes from the positive points and this basically comes up if you look at how to optimize SVM [inaudible] as well, so basically it is the standard gradient descent scheme that comes up in optimizing ROC score in general. If you actually tried to calculate this gradient naïvely what you will see is that you might end up getting order and square dependence because you're summing up over all the positive points and summing up individually all of the negative points, and you have to do the same thing over here as well. So if you try to calculate the gradients naïvely you end up getting order and square dependency, so there was this famous algorithm by Parson himself. We actually see that it applies in our case as well. What they end up doing is basically it ends up calculating this gradient in order nlogn time. The idea is you sort these terms, aj’s and ai’s in terms of in an increasing order and then you basically calculate these internal sums in linear time, so that is a pretty simple bookkeeping, algorithm which sorts these particular aj’s and ai’s and you can see that it is very easy to keep track of them so that you can calculate these individual sums in order n time once you have sorted them, so basically the entire complexity is due to the sorting algorithm which gives you order nlogn dependence to calculate the gradient. Yeah? >>: [inaudible] example of the [inaudible] will you end up with pretty bad concentration or was there a reason… >> Ankan Saha: Theoretically you can sub sample, but depending upon the amount of skewness in the labels that you have it can be a pretty bad estimate. >>: So can you use a sample in a weighted way from, if you take all the positives and take a sample of all the negatives and if it's negative and then properly risk it? >> Ankan Saha: We tried simple learning scaling methods but that was not working well and especially we tried doing experiments from the DNA data set which is actually a pretty skewed one and in that case the results were not that good, if you were trying to calculate this nicely. >>: Did you ever consider any guarantee of like I mean if you use that sample [inaudible] were you [inaudible] which means that these gains could be better than by changing the objective? >> Ankan Saha: Yeah. We did not go to the… >>: I'm just wondering if you were to sample how tightly [inaudible] converge the to the true [inaudible]? [inaudible]. [multiple speakers]. >> Ankan Saha: So you have an independent subsampling before every iteration or you just subsampling the beginning and… >>: [inaudible] you just take the sample you get. You just draw a sample instead of taking the full O nlogn for [inaudible]. >>: [inaudible] optimization process [inaudible] because of prime changes that [inaudible] you end up [inaudible] gradients which is [inaudible] solution. >> Ankan Saha: This is the gradient method that I was talking about. You end up getting an order of nlogn complexity [inaudible] for calculating the gradient and interestingly the same bookkeeping method actually allows you to calculate the smooth function value as well at every iteration. The entire complexity [inaudible] turns out to be order nlogn. I just included this slide to show you the variety of the data sets that we were looking at. In particular, this DNA is a monster data set, I am sorry. This DNA string is a monster data set and it has a very skewed ratio and there are some other very large data sets that we looked at as well. For example, kddb was a really large and OCR, OCR was very large as well. What we did is we compared with the standard cutting plane algorithms, in particular, bundled methods for regularized minimization and we tried various possible standard algorithms after smoothing, but in particular, the result that I'll show you was applying L-BFGS after you smooth algorithms. We basically calculated our smooth loss and gradient evaluation using pet CN Tower libraries from Argonne national labs. It helps in faster matrix vector multiplication and stuff like that. These are some of the curves that we have. The blue curve corresponds to our algorithm and the red curve corresponds to bundled methods and you can clearly see that both the top curves correspond to basically how fast the regularizing is decreasing and this corresponds to generalization performance. You can see in terms of both, the blue curve is way ahead of the red curve in terms of convergence. We actually use like 80% of the data for training and 20% for testing, so the generalization performance is pretty, significantly better than the bundled structure. That's mostly about ROC [inaudible] problem. In the remainder of the talk I will try to show you how this particular smoothing technique can be applied to various other problems in… Yes, you have a question? >>: You said, you compare with L-BFGS [inaudible] with this [inaudible]? >> Ankan Saha: No. We actually smoothed the object and ran L-BFGS as our optimize algorithm. >>: [inaudible] the blue is… >> Ankan Saha: Yeah, the blue is basically L-BFGS after applying smoothing [inaudible] smoothing. >>: Do you try [inaudible]? >> Ankan Saha: Yeah [inaudible], we tried that but that was not as fast as [inaudible]. It was still faster than the red line. It was still faster than [inaudible], other cutting methods in general, which is kind of surprising because in various real-life experiments the bundled method actually attain curves which are closer to order log 1 by epsilon as compared to order one by epsilon, so their theoretical upper bound is order one by epsilon, but they are often in, in most data sets their performances actually pretty close to log one by epsilon, which actually, initially [inaudible] was conjecturing that probably it was a weakness in their bounds but then we ended up showing lower bounds for bundled methods as well. That makes me wonder whether this is an artifact, I mean if it is some kind of weakness in terms of smooth analysis or something like that. Like these order one by square root of epsilon rates are optimum in terms of first-order methods. That has been theoretically shown by number ROC but in many real live data sets they actually perform amazingly better compared to order one by square root epsilon, so if you look at the curves it is also as close to log one by epsilon, so I don't know if it's an artifact of--or whether there is any stronger analysis and what that analysis would say. >>: [inaudible] chosen? >>: It suggests the upper bound. >> Ankan Saha: Yes. >>: You have [inaudible] particular problem, this upper bound [inaudible]. >> Ankan Saha: Yes, but then of course the lower bound is also for specific problems that you handpick [inaudible] of course, in general cases it might be better. I'll talk about one of the, one other problem that on the surface looks completely different, so this is the problem of finding minimum enclosing convex shapes. In particular, the minimum enclosing ball problem. The problem is very simple. You have been given n points in d dimensional space and you want to find the smallest ball which encloses all of these points. There are a lot of implications in data mining machine learning even statistics and I was surprised to see--I reached this problem by looking at SVM solvers, so there is something called ball convex, core vector machine or ball vector machine which came out around 2007. It was considered to be a very fast SVM solver; what he did was basically solve the dual of the minimum enclosing ball problem and the previous best algorithms for this are from the computational geometry community and they have this scheme called finding the core sets which is--I won't go into the details of that but it is a pretty constructive approximation scheme that gives you one plus epsilon multiplication approximation guarantees in order nd by epsilon time where n is the number of points and d is the dimension and epsilon is the actual experiment. What we end up showing is that we can get to an epsilon approximation order nd by square root of the epsilon time, using the same kind of smoothing techniques. To look at the particular problem, the minimum enclosing ball problem can be formulated in this simple way. You have this unknown radius that you want to minimize and you don't know the center of the ball as well, so these are your variables and what you want to do is that the distance of every point from the center should be less than or equal to the radius, so it is as simple as that. If you want to minimize r over R squared for all xi’s, you might as well write it down as over maximum overall the xi’s over this term. If you open that out, that actually looks like this. This is a maximum over xi’s belonging, so basically the number of xi’s lying on a set so you can equally replace it on, by a variable lying on a simplex because the optimum of a variable on simplex will actually lie at the corners, so you end up getting something like this. You can clearly see that this is the sum of a strongly convex function and the exact dual formulation that I was talking about. Now you're dual variable actually lies in the simplex which is again a closed space, a compact space. Once you have this formulation you can basically smooth it out and throw Nesterov’s accelerated scheme method. In particular we use a variant of Nesterov’s accelerated gradient scheme. We just call it the accelerated gad scheme for this problem. It is pretty similar and the corresponding dual function looks like this form, so notice that the dual function is moved over here and in this case the duality gap is again given by this quantity and basically the entire rate is basically obtained by how fast you are shrinking the strong convexity parameter to another strong convexity that you are adding to zero. Basically mu K is going to zero at a quadratic rate over here, so that is giving you the rate of convergence. In this case Sigma is basically the strong convexity parameter, of the strong [inaudible] function that you are adding. Script D is upper bound on that small d function and you can just think of L as norm of A transpose. These are the various parameters that come in. The reason I kept all of these parameters in the rate is that these parameters are important. If you don't smooth in the proper way these parameters will throw in variance dependence on the number of data points or the dimension or something like that and will result in sub optimal rates. It's important to optimize it very carefully so that you end up getting the best possible results. The reason I'm mentioning is I'll give you an example for this. >>: [inaudible] bigger sigma than this [inaudible]? So you should gain by getting a larger sigma. >> Ankan Saha: Yeah, but if you have such a d which has a very strong convexity parameter then this [inaudible] might increase as well. We end up getting a bound on script D [inaudible]. Depending upon, so basically I'll give you two examples to illustrate this point. This small d function which is a strongly converse function is often referred to as prox function in the literature, so depending on what prox function you use you might get different kinds of rates, so of course we want to add a strong convex function, but a strongly convex function, the definition of a strongly convex function depends on the domain that you are working from. Suppose that currently my domain is just simplex, but suppose I endowed the simplex with d L 2 mu; in that case the natural definition of a strongly convex function to add is the 2 norm spread, so suppose I had added a 2 norm distance from the center of the simplex. In that case my script, the upper bound script D would have been sigma by 2, but this Lipschitz constant of the dual will actually now have a dependence on the maximum eigenvalue of the AA transport matrix and in general, if your AA is an n by d dimensional matrix this thing can have a dependence on the number of data points. In particular, the convergence bound that you will get will be dependent upon this script L which actually will have a dependence on n in the worst case. Whereas, if you actually endow the simplex with the L1 norm and choose your prox as the entropy function, which is strongly convex with respect to the simplex, basically the upper bound now incurs a logarithmic dependence on the number of data points, but this Lipschitz constant now becomes independent of the number of data points so it is just upper bounded by the maximum ball of the individual data points. In this case basically you end up getting just a logarithm dependence on the number of data points which gives us the order star one by square root epsilon rates of dependence, so this is exactly the bound that we get for our L [inaudible]. This will actually converge to epsilon accuracy in order one by the square root sigma time. As I mentioned, the problem of minimum enclosing ball can actually be applied to various other problems, in particular, it can be applied to the problem of finding, optimizing support vector machines, so in particular, this is the simple objective of support vector machines. You have the hinge loss plus a regularizer. The SVMs are often solved in the dual in existing literature and what was once, I mean what was done around 2005 and 2007 was people noticed that if you take the dual of the L2 SVM where you are actually working with the hinge loss squared, the dual of that particular problem actually looks exactly like the dual of the minimum enclosing ball problem. In that case people actually used the previous algorithms for existing minimum enclosing ball, and they just used it to solve the SVM dual, and they noticed that it actually gives you fast convergence in real life when you have large data sets. I think at one point of time core vector machines were the fastest SVM solvers in practice, so that was around ’07 or something. They were previously using core vector algorithm so they were getting order one by epsilon rates of convergence. If you use our scheme and just plug it as a subroutine for optimizing this, you will end up getting order one by square root epsilon rates of convergence. That is for core vector machines and one thing that might be of discord is that what we are doing here corresponds to L2 SVM, so basically hinge loss quiet. It turns out that you can't even solve standard SVM, so basically the L1 SVM, where you just have the hinge loss and you can solve it using these Nesterov style smoothing. What is obtained is that just to give you some introduction, previous state-of-the-art batch solvers, so basically bundle returns also solve the L1 SVM optimization problem and they give you order nd by lambda epsilon rates. You can also use stochastic methods like stochastic grade and descend Pegasos to solve the L1 SVM problem. They actually get independence of the number of data points but they will give you order d by lambda epsilon rates for convergence. What we did was we wrote down the L1 SVM objective again as strongly convex function and the hinge loss as a dual of a smooth function. It turns out that this function, the original function g, is as simple as the g of alpha is the summation of alpha i and if you take the dual of that you end up getting this [inaudible]. In this case the transform A is basically the product of the Y times X matrix where X is the data points stacked up in a matrix. Basically since your SVM can be written down in this formulation, you can just use Nesterov’s smoothing and then use any of the accelerated gradient descent scheme or even proximal gradient descent scheme which is another variant of Nesterov’s methods. They will both give you order n [inaudible] square root epsilon rates of convergence. Yes? >>: [inaudible] generalization [inaudible] epsilon yet then n the chosen [inaudible] something about [inaudible]. [inaudible] and then compute the running time that you need and get the same for [inaudible] and what happens there and that will be the correct [inaudible]. >> Ankan Saha: Yeah, so I mean one of the reasons we actually had this around 2009, but we had compared--it is hard to put this into perspective with Pegasos, so what we did we compared with a batch version of Pegasos and we beat the batch version of Pegasos straightaway, but then everybody was why are you comparing with a batch version of Pegasos. Pegasos is supposed to be an online algorithm and our method is strictly a batch algorithm so comparing it with the Pegasos algorithm which chooses one point at every iteration was like comparing apples and oranges, so this point that you made was exactly what we did after that, so it turns out that our method like in terms of experiments, we are actually kind of at the same league as Pegasos. I'll show you quite a few algorithms in the next slide that we compared it with. These are a bunch of algorithms that we compared it with. Batch Pegasos is basically the red line that we have. This is showing the primal function versus the number of iterations that it required to converge. Pegasos is basically the redline and ours is the blue line, so we did considerably better than the batch version of Pegasos but there is this algorithm Lib Linear which you cannot even see over here. It is somewhere over here, so that is like way faster than anything else that we, way faster than our method, way faster than any of the existing batch solvers that we tried to compare it with and it basically uses something like an implicit method in the dual so what it does is so this is the Lib Linear algorithm by [inaudible]. What they do is they look at the dual and update one coordinate at a time in the dual, so basically the dual coordinate is [inaudible] the dual and make the corresponding updates in the primal. It turns out they are way faster than any of the other algorithms, but then our objective over here was that we actually get faster rates than all of the existing standard one by epsilon algorithms that are existing in the literature. So yes, that is as far as this is concerned and so these are some other algorithms. These are some other plots. We're just basically showing how fast the primal function decreases in terms of the time in the particular seconds. One thing that should be noted is that our method over here is bounding the duality gap, whereas all of the other algorithms that we compare with, they are considered dual algorithms so they will give you a bound on d of alpha star minus d of Alpha K where alpha star is chosen according to some ground truth and then they just take that alpha K and get a WK corresponding to that using the same formula that is true only at the optimum. Theoretically, there is no bound to say how far that WK is actually from W Star. In experiments they performed reasonably well, pretty well, so that is what is generally taken as ground truth, but that actually is efficient that we are both theoretically as well as practically because you are bounding the duality gap and the distance optimum is sandwiched over there. I am pretty close to the end of my talk. Some other applications where I actually applied these things is basically the problem of finding the minimum enclosing convex polytope which is again, our computation is a geometry problem so basically the problem is you have a set of points, and you have been given a fixed polytope and you want to find what is the minimum magnification of the polytope that is required to enclose those points. It turns out this problem can also be applied to certain active learning problems and it will actually be used by Dan Roth in finding active learning algorithms for basically active learning SVM problems and you can convert it into the convex [inaudible] problem. And our methods can also be applied to other maximizing base problems, in particular, if you look at structured SVMs, then our methods go through over there as well and it goes through over the entire basically calculating all of the marginals and sub marginals can be done efficiently in our setting as well and we end up getting better rates than the state-of-the-art methods for structured SVM's are all again order one by epsilon due to Michael Collins and some of his coauthors. We beat that and got order one by square root epsilon rates and there is a problem of finding polytope distance which is basically looking at two polytopes and finding the distance them and that is also a place where we can apply our algorithms and get--so basically all of these are algorithms previously had rates of order one by epsilon to get the epsilon close to optimum. We improve all of them to get to order one by square root epsilon rates. In conclusion, what we showed is that we can get improved rates for various machine learning objectives by using this particular smoothing style that we have due to Nesterov and it gives you--the key message is that you need not necessarily use Nesterov’s algorithm itself. The key thing that actually happens is the particular kind of smoothing that you want to use, so suppose you smooth an object. If you can throw any good solver after that and can get very good experimental convergence in practice and this is true for very large data sets as well, so it is often very relevant. As part of future work we want to see if these appropriate smoothing can mimic the performance of second-order methods since we are already using L-BFGS and we basically say that okay, if we use smooth in this particular way we can actually just throw our second orders over and not incur the cost in calculating the hessian, so basically, we get performance as good as a second order solver but just by doing smoothing and then throwing something which is cheaper to calculate at every iteration. Another thing that we have done is basically applying these smoothing techniques to more complicated measures like ranking measures like NDCG and precision@k or even F1 score, so these smoothing methods that I talked about, currently we can do it efficiently only for ROC [inaudible] and precision recall breakeven point. It turns out we have to, I mean you have to handpick the smoothing technique for each of them. There is no general smoothing technique that will apply across the board. So far if something as simple as--as generated F1 score, we don't have a smoothing technique that gives you better rates. So that is also a potential future work that is existing right now. Yeah, so that's the end of the talk. [applause]. >> Ankan Saha: Any questions? >>: On the last experiment you showed that where the [inaudible] fast [inaudible] so since you already… >> Ankan Saha: [inaudible] linear. >>: Since you already smoothed [inaudible] here lower batch, did you try the L-BFGS on that one? >> Ankan Saha: We tried L-BFGS on that. It was much faster than our algorithm. So this graph that I showed, the blue line, is due to Nesterov’s algorithm; it is not due to L-BFGS. It comes closer to Lib Linear but it's still, I mean Lib Linear is still faster, but that is because the linear cord that is in the repositories of very heavily optimized code. I mean it uses a lot of caches and stuff like that and it does--it has like a separate cache that to do something and so on. I'm not exactly sure why Lib Linear is so fast. It says that theoretically it should be log one. >>: [inaudible]. [inaudible] gradient just to blend the volume [inaudible]. >>: [inaudible] goes faster than the [inaudible]. >>: And the average [inaudible]. >>: Yes, yes [inaudible]. >>: [inaudible] John performed this. >>: [inaudible] call him John [inaudible]. [laughter]. >>: [inaudible] after he run the L-BFGS [inaudible]. >>: Are you doing something different from Pegasos or… >> Ankan Saha: I don't waste my time looking for [inaudible]. [inaudible] each [inaudible]. >>: [inaudible]. >>: What's the power [inaudible] and learning [inaudible]? >>: [inaudible] the same. It's just a starting point for those. >> Ankan Saha: Yeah. We did not compare with [inaudible]. >>: I hope you have [inaudible] here. There is still epsilon where the… >>: [inaudible] if you want. It still [inaudible] direction [inaudible]. >>: So I think it's square root epsilon [inaudible]. >>: Square root of epsilon [inaudible]. >>: Sorry, epsilon squared. >>: [inaudible]. >> Ankan Saha: [inaudible] for experiments we tried but it is [inaudible] impossible [inaudible] and so basically we would start with [inaudible] epsilon by [inaudible] and then take 10 times that 10 squared times that times one [inaudible]. >>: I think the reason why [inaudible] to be epsilon [inaudible] is you want to prove the rate and [inaudible]. When you apply in the L-BFGS, I suspect that if you are starting over [inaudible] mu, then gradually you increase that. >> Ankan Saha: But there is a sweet spot. If you can [inaudible] increasing it at some point of time it becomes worse, so basically I think it's an artifact of the epsilon that you want to go to as, how close you want to go to the optimal solution as well, and it's not necessarily a welldefined thing. If you work with 500 points and you work with a particular epsilon, the mu that you need to set it for is often different than when you are working with 1 million points. You know what I mean? >> Sumit Gulwani: Any more questions? >> Ankan Saha: Thanks. [applause].