>> Lin Xiao: For this, I'm going to talk... algorithm for solving the sparse least-squares problem. It's joint...

advertisement
>> Lin Xiao: For this, I'm going to talk about a numerical optimization
algorithm for solving the sparse least-squares problem. It's joint work with
Tong Zhang from Rutgers University. So this is the sparse least-squares
problem. It's L1 regularized least-squares problem. It has classical
least-squares objective there. And also an L1 regularization term.
Here A and B are the data, problem data, and the lambda is the regularization
parameter. Let's look at a concrete example about sparse recovery. We have
seen lots of talks. That's good for me. I don't have to explain too much.
So this you have a signal X, which is sparse. The dimension N can be much -very big. And you make an M merriment [phonetic]. Here we focus on the case M
less than N. Each of the regions is linear merriment crafted by noise Z. So
this is the basic generalization model. And so after observing B and knowing
A, you want to recover X. Of course, knowing that you want a sparse X is a
convex opposition approach is just use L1 regularization.
And we have lots of applications, for example, in machine learning you can
think about regression with feature selection. You want small number of
features. And as [inaudible] in signal image compression, image noise
statistics like gene identification. And recently has lots of -- received lots
of attention because of compress sensing theory. So lots of work addressing
the property of the optimal solution to this problem.
If you have the optimal solution to this problem, how well it will cover and
what condition you can recover. But in this talk I'm going to give
optimization talk how fast you can solve this problem. So this is the focus of
this talk.
Okay. Let's make assumption that of course M less than N, this is -compressed sensing. You have less observations. And the solution, we assume
the solution is sparse. Of course, otherwise there's no point we use L1
regularization because we assume lambda is big enough so you've got a sparse
solution.
So let's look at this current numerical optimization algorithm for solving this
problem. Here at least three state-of-the-art algorithms. Here's the cost
prioritization of [inaudible] algorithm. This is the number of iterations they
need. For each optimum meaning objective function of that sparse least-squares
to epsilon close to the optimal.
And you can see that the method, it's good in terms of complexity. But each
iteration costs a lot, especially when M and N are big. Actually in
practice -- this is only theory you can prove. In practice, this is, you can
substitute this complexity with the constant say 100. That's done. So this is
really good.
Of course for large scale problems, maybe only viable approach is first order
gradient methods each intuition costs do vector matrix multiplication. And the
iteration, the classical I'll explain what is proximal grid method in the next
slides is one over epsilon. And using this [inaudible] techniques you'll get
one over squared root over epsilon.
So in this talk for this particular problem with some standard assumptions of
sparse recovery. Because this is a context we study this problem. And our
algorithm can -- basically combines the advantage of the sparse order method
and second order iteration complexity.
So as uniformly better than the previous algorithm. Okay. Before getting to
the details, I will first give some, discuss some background convex relaxation.
Hopefully it's not too boring. So if I say a function is a convex, so we have
this graph has a supporting hyperplane which given by the gradients. Of
course, it could be nondifferentiable here, multiple [inaudible] support plans.
If I say a function is at most convex function, if I say it's most, meaning
that there's the blue one is a quadratic upper bound of the function.
And with parameter F, for example, you can take L be the largest eigenvalue of
your largest eigenvalue of the hashing of your function if your function is
differentialable. So this is smooth.
And another one is strongly convex. Of course, I have to mention
this one why it's called smooth is because it cannot have -- it's
nondifferentiable kinks likes like this. You imagine this -- you
for every point, more point to hear than no quadratic upper bound
bound.
here that for
a
have the host
will be upper
So for strongly convex you have quadratic lower bound, with parameter mu. So
this is basic terminology I will use. Let's consider more general setup.
Called composite optimization [inaudible] function has at most both convex, of
course. This is a smooth part. This could be nonsmooth part like L1
regularization. And for these kind of problems we use a so-called proximal
gradient methods. So basically you start with some initial point. You do
iterative algorithm. At each step you calculate the gradient of the function
of the smooth part. And here this basically gives you a quadratic
approximation. Indeed, it's a quadratic upper bound order function.
And then you just take the nonsmooth part here without any change. This is
just approximation of your object function. But good thing is that if the
Poisson is simple enough you can start with this closed form. So it's very
easy. Now, for example, if a Poisson equals 0, this just gives you gradient
descent. And for the L1 case, the solution given by this so-called iterative
[inaudible] shooting. So basically you know -- first pretend that you're going
to do a gradient step with steps that are L. And then you take each component
of this vector, goes through this [inaudible] shooting operator. Basically you
want to shrink the magnitude. So by ARFA. Here you want to raise it because
you want to shrink the magnitude. If it's larger than R. But it's within this
less than ARFA then you take it to 0. So you can see if the lambda is big
enough or ARFA is big, you get lots of zeroes in your solution. That's why you
produce sparse solution. So that's the intuition of this method. But this is
not very important for this talk. Let's just look at the computational
complexity.
With this smooth assumption, then the iteration complexity for finding each
optimal solution is for the proximal gradient method I just described is 1 over
L. And, again, for [inaudible] is square root of one over epsilon.
Indeed this is called optimal method, meaning that there's no first order
methods can perform better on this class of problems. So this is for smooth
assumption. But now let's see what happens if we add another assumption F is
at most but strongly convex meaning not only have an upper bound but also has a
quadratic lower bound. Here mu should be strictly larger than 0. In this
case, the same algorithm without any change, we just use it, will give you this
logarithmic confidence rate. So it's still metric or you can call it linear
rate. It's much, much faster.
And optimal method again you will make this constant smaller. Here L over mu
we call it condition number. This is the square root of it. So we call this
is the optimal method. Again, optimal method meaning no other first order
methods can be better. This is already the best it can get.
Okay. Now we have enough background. Let's look at this L1 least-squares
problem. As I said before, we focus on the case M less than N. So N is kind
of a wide fat matrix. So F is a least-squares and you can check the hashing is
just A transpose A. You can see this function is not very small because it
just appears quadratic function. You take the largest eigenvalue of the A
transpose A. This gives you the parameter for your quadratic upper bound. So
it's good. For [inaudible] expansion is simple.
But it's not strongly convex, because the minimal eigenvalue is zero since this
shape, this is a rank M. It's not strongly convex. So that means for any
expect sublinear [inaudible] rate either these are that depending on you
accelerate or not. So that's the verdict from the theory, if you check the
function.
But now let's look at what happens in reality. Here is just the randomly
generated example with M equals 1,000 [inaudible] 5,000 and here's proximal
gradient method and two different versions of accelerated gradient method.
As we can see, this is, by the way, Y axis is optimality. Meaning how far from
the gap. And the horizontal is the number of iterations. As you can see in
the beginning as proximal gradient method this is kind of slow. This is log
scale on the vertical axis. So this is sublinear rate. And the accelerated
gradient method is faster in the beginning. But if you stop here you say,
actually you use accelerated gradient method this is the best method. But be
patient run longer suddenly you think there's dramatic change very fast
geometric convergence when you're getting close to the optimal point. But on
the other hand, rather accelerated method doesn't have this kind of face. So
you might wonder what happened? Let's look at the number of non-zeros sparsity
pattern of each iteration you get. Here's sparsity number of non-zeros there
and this is the iteration. As you can see, this fast convergence happens when
the number of nonzeros is very small. This is like hundred or 200 also.
That's when you get really fast convergence. And although accelerated gradient
method reach that method much early, it doesn't affect as much. But for this
red line you can see when you hit the sparse solution actually slows down. But
I still don't quite understand this one. But let's look at the proximal
gradient method. So the observation is that it has slow global convergence but
fast -- sparse not, because up close is very important because we start every
iteration from the zero. It's very sparse, but it's just far from the optimal.
You don't get past convergence.
Let's look at the reason. See your solution, assume your solution is really
sparse. And with lost of generality you can position the vector, position the
vector nonzero part this is the zero part and this is your metrics, how you
[inaudible] so you can say that only the N part is useful. And you catch
hashing is like this. Here if you restrict your consideration, your problem
into the subspace N, then first look at this. You could get a strong convex
local strong convexity, minimum eigenvalues can be larger than zero. At the
same time your upper bound is getting much smaller. So that's another thing, 1
over L is the steps you can go much faster. So if you minimize this problem
actually this is equivalent. If you had known the subspace ahead of time, you
can sort it very fast. But unfortunately you do not know the sub [inaudible]
so you have to think about a way to somehow adapt activity to get close of
finding the subspace. So here is Kim's idea. Based on these two observations,
homotopic continuation. Homomorphic is based, is general method, basically you
have class of problem parameters by a single parameter and the one you want to
solve is very hard. Solve directly. But if you vary the parameter from the
beginning it can be very easy change the parameter use the initialize your
numerical algorithms to solve the second one you get faster and then you do it
the overcomplexity might be much faster. Here is what happened is that you
want to always engage in sparse mode fast local convergence. So the idea is
for this L1, L squared problem, if you give large enough regularization lambda
like this number can be easily computed, you will get zero solution without any
change. And then you just reduce lambda a little bit, tiny bit. You say it's
very, very sparse. But from then on you can use proximal grid method you can
just get there like maybe very few steps. You will stay in the -- you will
always use this part of your numerical algorithm. And then you just reduce a
little bit and finally converge. So the hope is overall complexity is much
smaller and indeed this is not a new idea. There's some works already using
similar techniques, but superior performance has been reported but the problem
is there's no global complexity analysis. In order to answer that, you need to
have answer two questions. How fast you decrease lambda and also for each
lambda how accurate you solve. If you just solve to [inaudible] or just a few.
Here is proximal gradient method. You have two parameters based on that idea.
So both of them are within zero and one. So initialize the problem with this
parameter end up with zero solution and this is a number -- you reduce your
lambda with constant factor eta. This is the number of authorizations you need
to run. And at each step you reduce lambda by constant vector. And also you
saw it's very low precision. If ARFA can be zero plus one proportional to
regularization parameter and you solve it. For the last iteration you see I
require high precision epsilon you could go for whole epsilon.
already in the geometric zone so you are fast.
But you're
Let's look at example. The same example. Here is the previous result of this
proximal gradient. Now this, the black line is home topic. Each of these
stages is as I said each intermediate stage you only need very loose precision,
you can guide like two or three iterations and you actually moving this
geometric phase much earlier, happening here. So you get very, very fast
performance in practice.
Okay. Now we have seen all the ideas I hope to give you a little bit technical
reasoning behind it as we define the restricted eigenvalue. So only -- this is
the largest, smallest eigenvalue on the vector with sparse support no larger
than S. S is less than N. This is hoped for S is small enough if your matrix
A is behaving okay. Good, then you will have this minimal eigenvalue zero.
Okay. So let's remind us this previous picture. This is just restricted
eigenvalues. The important thing is smallest one should be larger than zero.
You've got strong convexity somehow. And here precisely if the support of your
two vector XY, the unit is less than certain sparse level S, which satisfies
restricts eigenvalue condition, then you will get this restricted smoothness.
So this is quadratic upper bound. Quadratic lower bound. If this is larger
than zero, you've got strong convexity and you've got fast convergence. Here
is the restricted condition number on that in a passport. So our basically
convergence analysis is basically assume M less than N, your target lambda is
sufficiently large that means all your previous lambda must be sufficiently
large and then local convergence. If you start a sparse solution and close
optimal, then proximal gradient has geometric convergence. The detail in that
paper so I'm not going to spout out the details. Overcomplexity data. Each
eta is both in zero and one satisfy this simple condition death of prime is the
prime actually comes from the restricted eigenvalue condition. I'm not going
to get details here. And then every stage of proximal gradient homotopic has
convergence, overcomplexity is just M times N is your prioritization
complexity. Number of iteration software log 1 over epsilon times the
condition number.
Okay. Let's summarize. Computational complexities for solving this sparse
least-squares problem as we talked about before, this current state of the art.
And the cost per iteration and iteration complexity we have made those, the
best in different numerical algorithms. Here, of course, we are using some
special structure of this problem that this is a squared -- this is A matrix
you have some restricted eigenvalue conditions.
I think -- I believe for general more general functions is numerical algorithms
should -- we have [inaudible] experiment it worked well. But it's just much
harder to give kind of restricted eigenvalue conditions. You have more general
convex function.
And here's some extensions. As we have seen before, using accelerated
techniques, it's possible to reduce this one to square root of the cover. We
have just finished this work with [inaudible] from CMU this summer. But
unfortunately not actually we have hidden log factor there. Log of the
condition number. This is because the algorithm need to somehow accelerated
gradient method need to know the mu.
But in this case we just don't know. We have to have adaptive process to
estimate length of mu to achieve that. That will cost you a kappa vector. And
maybe some extensions to fast low rank matrix recovery as what Marian talked
and Marian and his students resides currently working on this.
Okay. Thank you.
[applause]
>> Ofer Dekel:
>>:
Any questions for the speaker?
Now you have to in addition to lambda choose the eta as well.
>> Lin Xiao:
Which one?
>>: You have to choose the parameter with how fast to decrease lambda now,
right? So you have to make multiple rounds.
>> Lin Xiao: Yeah, but eta -- [inaudible] eta. But let me tell you this. So
this is the theoretical analysis. In practice we did experiment. All the
[inaudible] on eta. If you run the fast performance, always not satisfied our
restricted eigenvalue, always validate our condition. The reason is that for
the home topic algorithm to work, the most important thing is we want to make
sure your final stage which has precision, you reach the final stage in the
local restricted convex mode. Before that every stage has a very low
precision. That regime either a sublinear rate or linear rate is not clear
linear rate will be faster than the sub, the simple gradient algorithm. So in
practice it's not important at all. Important thing is you want to bring your
final stage in local -- so in practice also a question the parameter's choice
is much, much, you can be much wider than what the theory predicts.
>> Ofer Dekel:
Any further questions?
Okay.
Thank you, Lin.
[applause]
Download