>> Lin Xiao: For this, I'm going to talk about a numerical optimization algorithm for solving the sparse least-squares problem. It's joint work with Tong Zhang from Rutgers University. So this is the sparse least-squares problem. It's L1 regularized least-squares problem. It has classical least-squares objective there. And also an L1 regularization term. Here A and B are the data, problem data, and the lambda is the regularization parameter. Let's look at a concrete example about sparse recovery. We have seen lots of talks. That's good for me. I don't have to explain too much. So this you have a signal X, which is sparse. The dimension N can be much -very big. And you make an M merriment [phonetic]. Here we focus on the case M less than N. Each of the regions is linear merriment crafted by noise Z. So this is the basic generalization model. And so after observing B and knowing A, you want to recover X. Of course, knowing that you want a sparse X is a convex opposition approach is just use L1 regularization. And we have lots of applications, for example, in machine learning you can think about regression with feature selection. You want small number of features. And as [inaudible] in signal image compression, image noise statistics like gene identification. And recently has lots of -- received lots of attention because of compress sensing theory. So lots of work addressing the property of the optimal solution to this problem. If you have the optimal solution to this problem, how well it will cover and what condition you can recover. But in this talk I'm going to give optimization talk how fast you can solve this problem. So this is the focus of this talk. Okay. Let's make assumption that of course M less than N, this is -compressed sensing. You have less observations. And the solution, we assume the solution is sparse. Of course, otherwise there's no point we use L1 regularization because we assume lambda is big enough so you've got a sparse solution. So let's look at this current numerical optimization algorithm for solving this problem. Here at least three state-of-the-art algorithms. Here's the cost prioritization of [inaudible] algorithm. This is the number of iterations they need. For each optimum meaning objective function of that sparse least-squares to epsilon close to the optimal. And you can see that the method, it's good in terms of complexity. But each iteration costs a lot, especially when M and N are big. Actually in practice -- this is only theory you can prove. In practice, this is, you can substitute this complexity with the constant say 100. That's done. So this is really good. Of course for large scale problems, maybe only viable approach is first order gradient methods each intuition costs do vector matrix multiplication. And the iteration, the classical I'll explain what is proximal grid method in the next slides is one over epsilon. And using this [inaudible] techniques you'll get one over squared root over epsilon. So in this talk for this particular problem with some standard assumptions of sparse recovery. Because this is a context we study this problem. And our algorithm can -- basically combines the advantage of the sparse order method and second order iteration complexity. So as uniformly better than the previous algorithm. Okay. Before getting to the details, I will first give some, discuss some background convex relaxation. Hopefully it's not too boring. So if I say a function is a convex, so we have this graph has a supporting hyperplane which given by the gradients. Of course, it could be nondifferentiable here, multiple [inaudible] support plans. If I say a function is at most convex function, if I say it's most, meaning that there's the blue one is a quadratic upper bound of the function. And with parameter F, for example, you can take L be the largest eigenvalue of your largest eigenvalue of the hashing of your function if your function is differentialable. So this is smooth. And another one is strongly convex. Of course, I have to mention this one why it's called smooth is because it cannot have -- it's nondifferentiable kinks likes like this. You imagine this -- you for every point, more point to hear than no quadratic upper bound bound. here that for a have the host will be upper So for strongly convex you have quadratic lower bound, with parameter mu. So this is basic terminology I will use. Let's consider more general setup. Called composite optimization [inaudible] function has at most both convex, of course. This is a smooth part. This could be nonsmooth part like L1 regularization. And for these kind of problems we use a so-called proximal gradient methods. So basically you start with some initial point. You do iterative algorithm. At each step you calculate the gradient of the function of the smooth part. And here this basically gives you a quadratic approximation. Indeed, it's a quadratic upper bound order function. And then you just take the nonsmooth part here without any change. This is just approximation of your object function. But good thing is that if the Poisson is simple enough you can start with this closed form. So it's very easy. Now, for example, if a Poisson equals 0, this just gives you gradient descent. And for the L1 case, the solution given by this so-called iterative [inaudible] shooting. So basically you know -- first pretend that you're going to do a gradient step with steps that are L. And then you take each component of this vector, goes through this [inaudible] shooting operator. Basically you want to shrink the magnitude. So by ARFA. Here you want to raise it because you want to shrink the magnitude. If it's larger than R. But it's within this less than ARFA then you take it to 0. So you can see if the lambda is big enough or ARFA is big, you get lots of zeroes in your solution. That's why you produce sparse solution. So that's the intuition of this method. But this is not very important for this talk. Let's just look at the computational complexity. With this smooth assumption, then the iteration complexity for finding each optimal solution is for the proximal gradient method I just described is 1 over L. And, again, for [inaudible] is square root of one over epsilon. Indeed this is called optimal method, meaning that there's no first order methods can perform better on this class of problems. So this is for smooth assumption. But now let's see what happens if we add another assumption F is at most but strongly convex meaning not only have an upper bound but also has a quadratic lower bound. Here mu should be strictly larger than 0. In this case, the same algorithm without any change, we just use it, will give you this logarithmic confidence rate. So it's still metric or you can call it linear rate. It's much, much faster. And optimal method again you will make this constant smaller. Here L over mu we call it condition number. This is the square root of it. So we call this is the optimal method. Again, optimal method meaning no other first order methods can be better. This is already the best it can get. Okay. Now we have enough background. Let's look at this L1 least-squares problem. As I said before, we focus on the case M less than N. So N is kind of a wide fat matrix. So F is a least-squares and you can check the hashing is just A transpose A. You can see this function is not very small because it just appears quadratic function. You take the largest eigenvalue of the A transpose A. This gives you the parameter for your quadratic upper bound. So it's good. For [inaudible] expansion is simple. But it's not strongly convex, because the minimal eigenvalue is zero since this shape, this is a rank M. It's not strongly convex. So that means for any expect sublinear [inaudible] rate either these are that depending on you accelerate or not. So that's the verdict from the theory, if you check the function. But now let's look at what happens in reality. Here is just the randomly generated example with M equals 1,000 [inaudible] 5,000 and here's proximal gradient method and two different versions of accelerated gradient method. As we can see, this is, by the way, Y axis is optimality. Meaning how far from the gap. And the horizontal is the number of iterations. As you can see in the beginning as proximal gradient method this is kind of slow. This is log scale on the vertical axis. So this is sublinear rate. And the accelerated gradient method is faster in the beginning. But if you stop here you say, actually you use accelerated gradient method this is the best method. But be patient run longer suddenly you think there's dramatic change very fast geometric convergence when you're getting close to the optimal point. But on the other hand, rather accelerated method doesn't have this kind of face. So you might wonder what happened? Let's look at the number of non-zeros sparsity pattern of each iteration you get. Here's sparsity number of non-zeros there and this is the iteration. As you can see, this fast convergence happens when the number of nonzeros is very small. This is like hundred or 200 also. That's when you get really fast convergence. And although accelerated gradient method reach that method much early, it doesn't affect as much. But for this red line you can see when you hit the sparse solution actually slows down. But I still don't quite understand this one. But let's look at the proximal gradient method. So the observation is that it has slow global convergence but fast -- sparse not, because up close is very important because we start every iteration from the zero. It's very sparse, but it's just far from the optimal. You don't get past convergence. Let's look at the reason. See your solution, assume your solution is really sparse. And with lost of generality you can position the vector, position the vector nonzero part this is the zero part and this is your metrics, how you [inaudible] so you can say that only the N part is useful. And you catch hashing is like this. Here if you restrict your consideration, your problem into the subspace N, then first look at this. You could get a strong convex local strong convexity, minimum eigenvalues can be larger than zero. At the same time your upper bound is getting much smaller. So that's another thing, 1 over L is the steps you can go much faster. So if you minimize this problem actually this is equivalent. If you had known the subspace ahead of time, you can sort it very fast. But unfortunately you do not know the sub [inaudible] so you have to think about a way to somehow adapt activity to get close of finding the subspace. So here is Kim's idea. Based on these two observations, homotopic continuation. Homomorphic is based, is general method, basically you have class of problem parameters by a single parameter and the one you want to solve is very hard. Solve directly. But if you vary the parameter from the beginning it can be very easy change the parameter use the initialize your numerical algorithms to solve the second one you get faster and then you do it the overcomplexity might be much faster. Here is what happened is that you want to always engage in sparse mode fast local convergence. So the idea is for this L1, L squared problem, if you give large enough regularization lambda like this number can be easily computed, you will get zero solution without any change. And then you just reduce lambda a little bit, tiny bit. You say it's very, very sparse. But from then on you can use proximal grid method you can just get there like maybe very few steps. You will stay in the -- you will always use this part of your numerical algorithm. And then you just reduce a little bit and finally converge. So the hope is overall complexity is much smaller and indeed this is not a new idea. There's some works already using similar techniques, but superior performance has been reported but the problem is there's no global complexity analysis. In order to answer that, you need to have answer two questions. How fast you decrease lambda and also for each lambda how accurate you solve. If you just solve to [inaudible] or just a few. Here is proximal gradient method. You have two parameters based on that idea. So both of them are within zero and one. So initialize the problem with this parameter end up with zero solution and this is a number -- you reduce your lambda with constant factor eta. This is the number of authorizations you need to run. And at each step you reduce lambda by constant vector. And also you saw it's very low precision. If ARFA can be zero plus one proportional to regularization parameter and you solve it. For the last iteration you see I require high precision epsilon you could go for whole epsilon. already in the geometric zone so you are fast. But you're Let's look at example. The same example. Here is the previous result of this proximal gradient. Now this, the black line is home topic. Each of these stages is as I said each intermediate stage you only need very loose precision, you can guide like two or three iterations and you actually moving this geometric phase much earlier, happening here. So you get very, very fast performance in practice. Okay. Now we have seen all the ideas I hope to give you a little bit technical reasoning behind it as we define the restricted eigenvalue. So only -- this is the largest, smallest eigenvalue on the vector with sparse support no larger than S. S is less than N. This is hoped for S is small enough if your matrix A is behaving okay. Good, then you will have this minimal eigenvalue zero. Okay. So let's remind us this previous picture. This is just restricted eigenvalues. The important thing is smallest one should be larger than zero. You've got strong convexity somehow. And here precisely if the support of your two vector XY, the unit is less than certain sparse level S, which satisfies restricts eigenvalue condition, then you will get this restricted smoothness. So this is quadratic upper bound. Quadratic lower bound. If this is larger than zero, you've got strong convexity and you've got fast convergence. Here is the restricted condition number on that in a passport. So our basically convergence analysis is basically assume M less than N, your target lambda is sufficiently large that means all your previous lambda must be sufficiently large and then local convergence. If you start a sparse solution and close optimal, then proximal gradient has geometric convergence. The detail in that paper so I'm not going to spout out the details. Overcomplexity data. Each eta is both in zero and one satisfy this simple condition death of prime is the prime actually comes from the restricted eigenvalue condition. I'm not going to get details here. And then every stage of proximal gradient homotopic has convergence, overcomplexity is just M times N is your prioritization complexity. Number of iteration software log 1 over epsilon times the condition number. Okay. Let's summarize. Computational complexities for solving this sparse least-squares problem as we talked about before, this current state of the art. And the cost per iteration and iteration complexity we have made those, the best in different numerical algorithms. Here, of course, we are using some special structure of this problem that this is a squared -- this is A matrix you have some restricted eigenvalue conditions. I think -- I believe for general more general functions is numerical algorithms should -- we have [inaudible] experiment it worked well. But it's just much harder to give kind of restricted eigenvalue conditions. You have more general convex function. And here's some extensions. As we have seen before, using accelerated techniques, it's possible to reduce this one to square root of the cover. We have just finished this work with [inaudible] from CMU this summer. But unfortunately not actually we have hidden log factor there. Log of the condition number. This is because the algorithm need to somehow accelerated gradient method need to know the mu. But in this case we just don't know. We have to have adaptive process to estimate length of mu to achieve that. That will cost you a kappa vector. And maybe some extensions to fast low rank matrix recovery as what Marian talked and Marian and his students resides currently working on this. Okay. Thank you. [applause] >> Ofer Dekel: >>: Any questions for the speaker? Now you have to in addition to lambda choose the eta as well. >> Lin Xiao: Which one? >>: You have to choose the parameter with how fast to decrease lambda now, right? So you have to make multiple rounds. >> Lin Xiao: Yeah, but eta -- [inaudible] eta. But let me tell you this. So this is the theoretical analysis. In practice we did experiment. All the [inaudible] on eta. If you run the fast performance, always not satisfied our restricted eigenvalue, always validate our condition. The reason is that for the home topic algorithm to work, the most important thing is we want to make sure your final stage which has precision, you reach the final stage in the local restricted convex mode. Before that every stage has a very low precision. That regime either a sublinear rate or linear rate is not clear linear rate will be faster than the sub, the simple gradient algorithm. So in practice it's not important at all. Important thing is you want to bring your final stage in local -- so in practice also a question the parameter's choice is much, much, you can be much wider than what the theory predicts. >> Ofer Dekel: Any further questions? Okay. Thank you, Lin. [applause]