>> Ofer Dekel: And I'm pleased to welcome Maryam... will talk about optimization and sparse recovery.

advertisement
>> Ofer Dekel: And I'm pleased to welcome Maryam Fazel from UW, who
will talk about optimization and sparse recovery.
>> Maryam Fazel:
[applause]
Thank you.
>> Maryam Fazel: Is the microphone on? Everybody hears, okay. All
right. So I'll be talking about recovery of simultaneously structured
models using convex optimization. And this is joint work with my
graduate student Amin Jalali, who is here. There's Amin. And we have
collaborators at Cal Tech, Babek Hassabi Samet Oymak, graduate student
at Cal Tech. As well as Yonina Eldar at Technion.
So I'll also go ahead and introduce the rest of my group because
they're also here. There's also Karthik Mohan. Dennis Meng. There's
Dennis. Reza Egbahli, and Brian Hutchinson, was here earlier.
But, okay, so I'm going to be talking about this general setup of -- we
would like to fit a model to some given measurement or observations of
that model.
And you have some, you can think of it as prior information about the
model. That is this model is in some sense lower dimensional. You can
think of this as actually the setting Harry was talking about, for
example, low dimensional manifold in your data, or it could be other
models.
I'll be talking about models with low dimensional structure. Also you
can think of it as low degrees of freedom relative to their ambient
space. So they live in high dimensional ambient space, but they have
lower degrees of freedom than what ambient would suggest.
The goal is to recover such models given some kind of information or
observation or measurements of these models. And the applications of
this very broad idea come up in signal processing in the sensing and
recovery of signals, various kinds of signals in machine learning, and
in the topic of system identification and control, dynamical system
identification.
Typical questions that come up, things you're interested in, are what
kind of convex penalties or convex regularizer or regularizing
functions can we construct to encourage our model to have the desired
structure, and then how do we quantify the performance of such
regularizers.
So what do we mean by these low dimensional structures or structured
models? Some typical examples that have been studied recently a lot.
Sparse vector. I have a vector in N dimensions where N is very large.
It has mostly 0s in it. And let's say it has K nonzero entries in it.
In this particular object, its measurements and its recovery gave rise
to the area of compressed sensing.
It's a huge area by now. Also the idea behind the last method by
Tibshirani in '96, it's been used in a lot of application areas in
particular in image noising. And another extension of that idea or
another type of structure is group sparse vectors. So vectors that
also have a sparse structure, lots of zeros in them but the zeros come
in different groups.
And there's group lasso, for example, that uses that idea, another kind
of structure, a different kind, it's not about zeros or nonzeros
anymore is lower ranked matrices. Matrix that has small rank, small
number of nonzero singular values, and therefore it's ranked space and
[inaudible] space are low dimensional.
So the rank matrices come up in many areas as well, for example,
collaborative filtering, an example of which is the movie
recommendation system that, for example, Carlos mentioned this morning,
Netflix: You want to recommend users to movies. A lot of methods for
solving that problem are based on the low rank matrix model of the
database matrix.
So without going into detail, I'll just quickly mention of some of
these. So collaborative filtering, it's been studied for this point of
view by several people, also comes up in control of or identification
of dynamical systems, identification. Over there you look for lowering
rank matrices.
Another structure that came kind of sequentially after this is a model
that is sparse plus low rank. So it's a matrix. It has a sparse
component plus a low rank component. And that came up in, for example,
robust component analysis, where the low rank component corresponds to
the principal component and the sparse component represents outliers.
A sparse set of outliers that add to or modify your principal component
matrix and also comes up in graphical models and so on. There's one
more structure that's actually the focus of this talk. You can also
have not this kind but you can have sparse and low rank appear
differently. You can have a model that is simultaneously sparse and
low rank. For example, I'm thinking of a matrix that has lots of zero
rows and columns in it and also is low rank. So we will see much more
about this and the application of it later.
So now the problem is this. Now that we saw what kind of structures
we're talking about, recovery of these structures, the problem is
usually set up like this. We had an unknown but structured model,
let's say we modeled it as a vector in RN. Again N is the ambient
dimension, can be a large number. We're given observations of that
model, so we have a linear map acting on that N dimensional vector and
giving us N measurement. So this is from RN to MN, where M is a lot
less than N. We have an undetermined set of linear equations given on
our unknown model.
Now, the problem is this, or the goal is this: Given the measurement
map, we can represent by N by N matrix as well in the simple case and
given the measurements, so there's M of those and knowing the structure
type we want to find X knot, find the desired model.
And a lot of recent research has focused on the following questions for
different structures. One is how do we find the desired model given
these observations? It requires often setting up an optimization
problem whose solution can be shown to be the desired model. And, by
the way, we want to find it in a computationally efficient way.
And another question of a different kind: How many measurements M
suffice or how large should N be for this to work? Recall M is the
number of measurements. It's in some sense the measure of the
complexity of your model, how many measurements you need to recover.
Of course it depends on the type of measurement. So this question
really doesn't mean anything if I just say it like that. But in order
to quantify it and for the purpose of analysis, we are going to assume
generic measurements, which means that I'm going to take this, the
measurement map G to be an N by N matrix whose entries are drawn from a
Gaussian distribution zero mean unit variance, standard normal.
That makes that measurement matrix be representative of all N by N
matrices in the sense of forming a dense open subset of matrices and in
that sense it's generic, and a typical assumption for this kind of
analysis.
So what are existing results like this? You may have seen some of
these. But the first one is, and the most famous, is recovery of
sparse vectors. I have X knot that's K sparse, only K nonzeros; it
lives in RN. And I have N measurements of it. And one way I can try
to recover that vector is that I can find the sparsest vector subject
to my measurements.
So this is L0 norm, which basically means the cardinality or the number
of non-zero enemies. If I could find among all Xs that satisfy the
measurements, the one that is the sparsest, then I can hope that I can
recover the K sparse one, if I have enough measurements.
But the problem is this is a noncovex problem and I can't really solve
it, but as a benchmark let's actually ask the question how many
measurements we need, how many M do we need for this to work, to give
us the unique solution to be X knot. And the answer here, it's been
studied before, and the answer is this order of K. K is how many
nonzeros is in the vector. In some sense that's the degrees of freedom
or the dimension of the low dimensional manifold, if you want to think
of it that way, this particular structured X lies on. So I only need
ordered K observation for this program to give me the correct X knot.
In fact, it's 2K plus 1 but it's on the order of K. Now, this is not
that useful of a result, but this one is extremely useful. I can't
solve this, because this cardinality function is not convex.
However, if I relax it to this convex function, which is L1 norm of X,
sum of absolute values of entries, now I can solve this problem because
it's convex. And the same question has this answer: How many
measurements do I need to recover X knot correctly. The number is on
the order of K log N over K.
So important thing is it depends on K much more than it depends on N.
In fact, the dependence on N is very minor, very mild. It's
logarithmic.
So if you look at these two results, it's striking that a convex
relaxation needs on the same order modulo the log factor of
measurements to recover the sparse vector. So when this result came
out, it was actually a really big deal and it has impacted the whole
field of compressing. These are the main papers that's initiated the
study. Let me also say what this is, with high probability. So I'm
actually making these statements when G is picked generically, and I'm
saying with high probability means that the probability of exact
recovery goes to one exponentially with the number of measurements. So
1 minus E to the minus CM, I didn't write it down but that's what it
means with high probability through this talk.
Okay. So this is famous result. There exists another parallel similar
type of result for another type of structure and that's a low rank
matrix. For low rank matrices we have this norm. It's called face
norm or Nokia, shat one norm [phonetic] which is equal to the sum of
singular values of the matrix and it behaves exactly like L1 norm did
for vectors.
So here I have the matrix. The rank is R, R is much less than the
dimension N. And if I were to solve this noncovex problem of finding
the minimum rank matrix to the subject to measurements order of N times
R observations are enough. If I actually find the convex relaxation,
which is this, it has been shown -- in fact this is from a paper by
Candes and Plan '09, that all order of NR observations are also enough
for this problem to give you the same X knot.
And again very striking: Same order, even though we went from a
noncovex problem to a convex problem, order NR measurements are enough.
And another thing to note is N times R is on the order of degrees of
freedom of a rank R matrix where I mean by degrees of freedom I mean
how many parameters for example you need to describe the matrix.
Okay. So these are two well-known results of this kind. And in this
talk we want to look at another kind of structure, simultaneous
structure.
And in many applications I know more than one piece of information
about my model. I know that the model is simultaneously structured in
more than one way. And one would hope if I take into account all of
these structures that I know at the same time I should do better in
terms of how many measurements I need for recovery, because I've
reduced the degrees of freedom. The object now we have several
structures at the same time.
So for that one problem to consider is the following convex relaxation.
I want to consider regularizers. So these are some functions that we
are going to identify whose linear combination when minimized will
recover X knot subject to the constraints.
We are also going to allow extra information on this object modeled by
convex column C. Another way to look at this problem, which may be
more familiar through machine learning audience, is think of it that
way. This is -- the X is the unknown or the model I want to find.
These are the measurements I have and there's a loss function that
penalizes how far I am from matching my measurements.
So it's kind of a fittinger error or loss and I have regularizers that
try to encourage particular structures and I have tau of them linear D
combined adding to each other, because I know tau things about this
structure.
Think of it like this. And what we're trying to do, we're basically
trying to make a statement about sample complexity of such an estimator
to recover the correct solution.
And application of this is, it comes up in signal processing. In
particular in optics. And this is a classic problem called phase
retrieval. So in this problem you have a signal X knot that you make
measurements of and the measurements are linear except that after you
take this linear measurement of X knot with the vectors AI, you will
lose the sign, or if it's a complex number, you will lose the phase of
the measurement. So you have absolute value of that number given to
you.
So you have N measurements, but they're phaseless measurements and the
problem is now we still want to recover X knot. How do we do this? It
can be reformulated as follows: So this is not a linear constraint
anymore. We can lineralize this by defining a new variable.
I define a matrix that's a rank one matrix. It's the outer product of
the signal X knot, and then the measurements can be rewritten like
this. AIA transpose inner product with X, these squared, with simply
squaring both sides of the equation. And now I have an equation or
measurement that are linear in this new object, which is a matrix.
This matrix is ranked one. And it's positive semidefinite. And it has
this linear constraints which are these measurements. So this has been
done in recent paper by Candes and co-authors. And now what we are
going to do is that in a lot of applications we know the signal X knot
we're trying to recover is also sparse.
So what does that mean about matrix X, it's rank one and sparse at the
same time. It's an application in which we have more than one
structure for the same object. And also one reason this comes up is in
optics, when you're measuring, you're measuring intensity and that's
why we can measure the phase. Measuring phases is very hard. It's a
bit problem for signal processing [inaudible] in different solutions.
Okay. What are our results? We are going to give theoretical analysis
of general simultaneous structures. And we are going to show the
combined convex penalties has a fundamental limitation, and we're going
to specifically restrict ourselves to the case of sparse underranked
matrices, show down if I write down a convex and noncovex problem for
recovery, unlike the cases we saw for sparse vectors and matrices
there's a large gap.
So I need to define -- I'll have one definition to make and then one
lemma and one theorem. So the definition is this one. We are going to
look at particular types of norms -- time is very limited. I'll go
fast on this one. But we're basically going to look at regularizers
that are norms. These norms have a certain property.
And the property of the norm is called decomposable norm at a
particular point X is decomposable if it has the following property.
If there exists a subspace T which acts like the support and there
exists a vector E which acts like sine, such that all subgradients of
the norm at point X can be written into these two pieces. So this is
projection on to T and this is projection on to orthogonal complement
of the subspace T. Maybe it's best to see it in an example.
If I'm talking about, for example, sparse vectors like this point in
two dimensions, this is the unit ball of the L1 norm and at this point,
which is the point 10, the sine vector is going to be this unit vector
of 1 and 0. And, for example, if I were at this point it would be
minus 1 and 0. It captures this sine. And this subspace T is going to
be this axis, T perp is going to be the orthogonal complement, which is
this axis. And G is, for example, a subgradient of L1 norm at point X
knot.
So the norm is not differentiable at that point, therefore I have a
whole set of subgradients which is this cone. G is one of those. And
this decomposability says G can be written as E plus projection of G on
to T perp. And this projection has dual norm less than one. In this
case it means infinity norm less than one, means the projection noise
between minus one and one on this axis.
That's what decomposability means for L1 norm, but actually it holds
for a bunch of other norms, too. For example, L1-2 norm, other mixed
norms, no clear norm and so on.
Using that definition, we're going to have a family of such norms, tau
of these norms. And I'm going to just denote the sine vector
corresponding by every norm by EI, the support by TI, and most
important object we need to define is the intersection of all the
support spaces. T cap is the intersection of all the TIs.
And projected signs are denoted sign projected T Cap denoted by this, I
think the batch may not [inaudible] oh, thank you.
>>: Green button.
>> Maryam Fazel: So the projection of E on to E cap is denoted by this
notation and these are some angle between the E cap and the EI. So
what does that mean? In this example, suppose this is X knot. It has
simultaneously two structures defined by the first structure has this
support space T1, and it has the sine vector E1 that lies in T, T1.
The second one has this support space T2 and sine vector E2. XI is in
the intersection of the two spaces which is this line, which is T cap.
And then the projected sine vectors are going to be these two projected
on to the T cap. So the reason you're defining this is that in order
to analyze the performance of such methods, we need to have a way to
capture the geometry of each individual norm plus the relative
geometry, how do they relate to each other.
And so we are going to look at this optimization problem, minimizing
summation of norms, subject to linear equality measurements.
And here lambda I organization parameters. And one result we can show
just based on optimizing the conditions of this optimization problem is
that if the number of measurements M is less than this bound, then the
program will not recover X knot with very high probability. So
recovery will fail with very high probability.
Now, this quantity is depends on this function if used in the function
here, it says you look at among the subgradients, you look at the one
with the smallest projection on to T cap. This is not so informative,
but let me actually specialize it with one more assumption.
If I make the extra assumption that the inner product between all the
projected signs are positive, for all pairs I and J, if you're
interested in how we justify this, I can tell you later. But it's not
a bad assumption we can justify. With this assumption, our result
actually simplifies to the following: If M is less than a constant
times the minimum of the dimension of the TIs, so among all these Ts I
take the one that has the smallest dimension, if M is less than that
value, then recovery fails with very high probability.
So that's actually the main point of our result, which is the
bottleneck and recovery, lower bound on how many measurements you need
for it to even be able to recover the answer is going to depend on the
minimum dimension of the Ts, rather than one would have hoped it would
depend on the dimension of the T cap, because that's where the X knot
belongs and that captures real degrees of freedom of X knot.
So we can also handle additional call constraints I didn't include in
this result. So that is the surprising thing. Our bottleneck is going
to be the minimal of the TIs, not the dimension of the intersection as
one would have hope. This is about the constant that appeared there.
We don't care now. And if we actually specialize this result to the
case of sparse and/or rank matrices we will see the gap clearly.
So the matrix of size N by N, rank R, which is nonzero only on a K by K
sub matrix, has this many true degrees of freedom, true number of
parameters to represent that matrix.
If I solve the noncovex optimization problem like that, and here this
means the number of non-zero columns and this is the number of non-zero
rows, I'm using the mixed norms to encourage the block structure. And
this is rank. If you solve this problem, we need this many
measurements to recover X knot.
If I solved the convex relaxation, including the PST constraint, I'm
going to require order of N times R. So the gap between the two, this
depends on ambient dimensions, linear in N. This is linear in K and
only depends logarithmically on N. Very large gap.
If you contrast this, for example, so there are more such results and
actually we can verify it numerically as well with experiments, but I'm
out of time, ask me later if would you like to know. But to summarize,
we would like to have regularizers that would recover simultaneously
structured object. And most common thing to do is you take a
combination of known penalties or norms that promote each of these
structures and minimize those. However, surprisingly, such a program
has bad performance in the sense that it requires many more generic
measurements than what you would expect based on degrees of freedom of
the object and contrast this with how cardinality and -- how well 01
works compared to cardinality and how well trace works compared to
rank. There were no gaps in these recovery results. But simultaneous
recovery has a big gap.
And, finally, so this is our result. But we have a lot of things to
continue after this. Some of these are, for example, here I assumed
Gaussian random measurements, but, for example, in phase retrieval
problem measurements have these forms. We need to extend the results
to this more general case. We also would like to prove, when we say
recovery fails, it fails badly in the sense that the solution you find
is at least certain distance away from the true solution.
So you're not failing meaning that you're very close to the true
solution. Okay. I'm completely out of time. So I'm going to not talk
about algorithms and other applications, but if you have questions,
I'll be happy to describe more. Thank you.
[applause]
>> Ofer Dekel:
Time for a quick question or two for Maryam.
>>: Is there a goal to the -- maybe there's some other way [inaudible]
X knot and [inaudible] other than by the combination [inaudible].
>> Maryam Fazel: Yeah, very good question. So that's a question we
would like to address. It's not obvious. I mean, you would have to go
back and see, for example: Can you define new atoms or unit objects
that capture both objects at the same time. Basically you try to
describe the intersection space from scratch.
And not thinking of it as intersection of this space and that space,
because if you do that, we should all such combinations will fail. So
is there another approach. In general, probably it's hard to answer,
but in specific cases one could approach them one by one and see if you
can construct the correct norm in some sense.
>> Ofer Dekel:
One last question.
>>: Question about special case of sparse matrix. If you have
[inaudible] variable that's subject to [inaudible] co-variable, so use
of special case of structural, could be retrievable special structure
or not.
>> Maryam Fazel:
I'm sorry, I didn't --
>>: Let's say I have predictive model and task.
variables.
>> Maryam Fazel:
I have critical
Categorical --
>>: Categorical variable to number of categories minus one [inaudible]
and [inaudible] happens only once number of categories.
>> Maryam Fazel:
Syndicator.
>>: Little bit sparsest. Not absolutely sparse. But [inaudible]
sparse. Is this a special case when your technique would be
applicable?
>> Maryam Fazel: I'm not sure. So I'll talk to you afterwards in more
detail. But approximately sparse signals have been also studied.
Doesn't need to be exactly sparse. And a lot of these results we can
say approximately structured. So, for example, close to rank or close
to sparse.
>> Ofer Dekel:
[applause]
Let's thank the speaker.
Download