>> Ofer Dekel: And I'm pleased to welcome Maryam Fazel from UW, who will talk about optimization and sparse recovery. >> Maryam Fazel: [applause] Thank you. >> Maryam Fazel: Is the microphone on? Everybody hears, okay. All right. So I'll be talking about recovery of simultaneously structured models using convex optimization. And this is joint work with my graduate student Amin Jalali, who is here. There's Amin. And we have collaborators at Cal Tech, Babek Hassabi Samet Oymak, graduate student at Cal Tech. As well as Yonina Eldar at Technion. So I'll also go ahead and introduce the rest of my group because they're also here. There's also Karthik Mohan. Dennis Meng. There's Dennis. Reza Egbahli, and Brian Hutchinson, was here earlier. But, okay, so I'm going to be talking about this general setup of -- we would like to fit a model to some given measurement or observations of that model. And you have some, you can think of it as prior information about the model. That is this model is in some sense lower dimensional. You can think of this as actually the setting Harry was talking about, for example, low dimensional manifold in your data, or it could be other models. I'll be talking about models with low dimensional structure. Also you can think of it as low degrees of freedom relative to their ambient space. So they live in high dimensional ambient space, but they have lower degrees of freedom than what ambient would suggest. The goal is to recover such models given some kind of information or observation or measurements of these models. And the applications of this very broad idea come up in signal processing in the sensing and recovery of signals, various kinds of signals in machine learning, and in the topic of system identification and control, dynamical system identification. Typical questions that come up, things you're interested in, are what kind of convex penalties or convex regularizer or regularizing functions can we construct to encourage our model to have the desired structure, and then how do we quantify the performance of such regularizers. So what do we mean by these low dimensional structures or structured models? Some typical examples that have been studied recently a lot. Sparse vector. I have a vector in N dimensions where N is very large. It has mostly 0s in it. And let's say it has K nonzero entries in it. In this particular object, its measurements and its recovery gave rise to the area of compressed sensing. It's a huge area by now. Also the idea behind the last method by Tibshirani in '96, it's been used in a lot of application areas in particular in image noising. And another extension of that idea or another type of structure is group sparse vectors. So vectors that also have a sparse structure, lots of zeros in them but the zeros come in different groups. And there's group lasso, for example, that uses that idea, another kind of structure, a different kind, it's not about zeros or nonzeros anymore is lower ranked matrices. Matrix that has small rank, small number of nonzero singular values, and therefore it's ranked space and [inaudible] space are low dimensional. So the rank matrices come up in many areas as well, for example, collaborative filtering, an example of which is the movie recommendation system that, for example, Carlos mentioned this morning, Netflix: You want to recommend users to movies. A lot of methods for solving that problem are based on the low rank matrix model of the database matrix. So without going into detail, I'll just quickly mention of some of these. So collaborative filtering, it's been studied for this point of view by several people, also comes up in control of or identification of dynamical systems, identification. Over there you look for lowering rank matrices. Another structure that came kind of sequentially after this is a model that is sparse plus low rank. So it's a matrix. It has a sparse component plus a low rank component. And that came up in, for example, robust component analysis, where the low rank component corresponds to the principal component and the sparse component represents outliers. A sparse set of outliers that add to or modify your principal component matrix and also comes up in graphical models and so on. There's one more structure that's actually the focus of this talk. You can also have not this kind but you can have sparse and low rank appear differently. You can have a model that is simultaneously sparse and low rank. For example, I'm thinking of a matrix that has lots of zero rows and columns in it and also is low rank. So we will see much more about this and the application of it later. So now the problem is this. Now that we saw what kind of structures we're talking about, recovery of these structures, the problem is usually set up like this. We had an unknown but structured model, let's say we modeled it as a vector in RN. Again N is the ambient dimension, can be a large number. We're given observations of that model, so we have a linear map acting on that N dimensional vector and giving us N measurement. So this is from RN to MN, where M is a lot less than N. We have an undetermined set of linear equations given on our unknown model. Now, the problem is this, or the goal is this: Given the measurement map, we can represent by N by N matrix as well in the simple case and given the measurements, so there's M of those and knowing the structure type we want to find X knot, find the desired model. And a lot of recent research has focused on the following questions for different structures. One is how do we find the desired model given these observations? It requires often setting up an optimization problem whose solution can be shown to be the desired model. And, by the way, we want to find it in a computationally efficient way. And another question of a different kind: How many measurements M suffice or how large should N be for this to work? Recall M is the number of measurements. It's in some sense the measure of the complexity of your model, how many measurements you need to recover. Of course it depends on the type of measurement. So this question really doesn't mean anything if I just say it like that. But in order to quantify it and for the purpose of analysis, we are going to assume generic measurements, which means that I'm going to take this, the measurement map G to be an N by N matrix whose entries are drawn from a Gaussian distribution zero mean unit variance, standard normal. That makes that measurement matrix be representative of all N by N matrices in the sense of forming a dense open subset of matrices and in that sense it's generic, and a typical assumption for this kind of analysis. So what are existing results like this? You may have seen some of these. But the first one is, and the most famous, is recovery of sparse vectors. I have X knot that's K sparse, only K nonzeros; it lives in RN. And I have N measurements of it. And one way I can try to recover that vector is that I can find the sparsest vector subject to my measurements. So this is L0 norm, which basically means the cardinality or the number of non-zero enemies. If I could find among all Xs that satisfy the measurements, the one that is the sparsest, then I can hope that I can recover the K sparse one, if I have enough measurements. But the problem is this is a noncovex problem and I can't really solve it, but as a benchmark let's actually ask the question how many measurements we need, how many M do we need for this to work, to give us the unique solution to be X knot. And the answer here, it's been studied before, and the answer is this order of K. K is how many nonzeros is in the vector. In some sense that's the degrees of freedom or the dimension of the low dimensional manifold, if you want to think of it that way, this particular structured X lies on. So I only need ordered K observation for this program to give me the correct X knot. In fact, it's 2K plus 1 but it's on the order of K. Now, this is not that useful of a result, but this one is extremely useful. I can't solve this, because this cardinality function is not convex. However, if I relax it to this convex function, which is L1 norm of X, sum of absolute values of entries, now I can solve this problem because it's convex. And the same question has this answer: How many measurements do I need to recover X knot correctly. The number is on the order of K log N over K. So important thing is it depends on K much more than it depends on N. In fact, the dependence on N is very minor, very mild. It's logarithmic. So if you look at these two results, it's striking that a convex relaxation needs on the same order modulo the log factor of measurements to recover the sparse vector. So when this result came out, it was actually a really big deal and it has impacted the whole field of compressing. These are the main papers that's initiated the study. Let me also say what this is, with high probability. So I'm actually making these statements when G is picked generically, and I'm saying with high probability means that the probability of exact recovery goes to one exponentially with the number of measurements. So 1 minus E to the minus CM, I didn't write it down but that's what it means with high probability through this talk. Okay. So this is famous result. There exists another parallel similar type of result for another type of structure and that's a low rank matrix. For low rank matrices we have this norm. It's called face norm or Nokia, shat one norm [phonetic] which is equal to the sum of singular values of the matrix and it behaves exactly like L1 norm did for vectors. So here I have the matrix. The rank is R, R is much less than the dimension N. And if I were to solve this noncovex problem of finding the minimum rank matrix to the subject to measurements order of N times R observations are enough. If I actually find the convex relaxation, which is this, it has been shown -- in fact this is from a paper by Candes and Plan '09, that all order of NR observations are also enough for this problem to give you the same X knot. And again very striking: Same order, even though we went from a noncovex problem to a convex problem, order NR measurements are enough. And another thing to note is N times R is on the order of degrees of freedom of a rank R matrix where I mean by degrees of freedom I mean how many parameters for example you need to describe the matrix. Okay. So these are two well-known results of this kind. And in this talk we want to look at another kind of structure, simultaneous structure. And in many applications I know more than one piece of information about my model. I know that the model is simultaneously structured in more than one way. And one would hope if I take into account all of these structures that I know at the same time I should do better in terms of how many measurements I need for recovery, because I've reduced the degrees of freedom. The object now we have several structures at the same time. So for that one problem to consider is the following convex relaxation. I want to consider regularizers. So these are some functions that we are going to identify whose linear combination when minimized will recover X knot subject to the constraints. We are also going to allow extra information on this object modeled by convex column C. Another way to look at this problem, which may be more familiar through machine learning audience, is think of it that way. This is -- the X is the unknown or the model I want to find. These are the measurements I have and there's a loss function that penalizes how far I am from matching my measurements. So it's kind of a fittinger error or loss and I have regularizers that try to encourage particular structures and I have tau of them linear D combined adding to each other, because I know tau things about this structure. Think of it like this. And what we're trying to do, we're basically trying to make a statement about sample complexity of such an estimator to recover the correct solution. And application of this is, it comes up in signal processing. In particular in optics. And this is a classic problem called phase retrieval. So in this problem you have a signal X knot that you make measurements of and the measurements are linear except that after you take this linear measurement of X knot with the vectors AI, you will lose the sign, or if it's a complex number, you will lose the phase of the measurement. So you have absolute value of that number given to you. So you have N measurements, but they're phaseless measurements and the problem is now we still want to recover X knot. How do we do this? It can be reformulated as follows: So this is not a linear constraint anymore. We can lineralize this by defining a new variable. I define a matrix that's a rank one matrix. It's the outer product of the signal X knot, and then the measurements can be rewritten like this. AIA transpose inner product with X, these squared, with simply squaring both sides of the equation. And now I have an equation or measurement that are linear in this new object, which is a matrix. This matrix is ranked one. And it's positive semidefinite. And it has this linear constraints which are these measurements. So this has been done in recent paper by Candes and co-authors. And now what we are going to do is that in a lot of applications we know the signal X knot we're trying to recover is also sparse. So what does that mean about matrix X, it's rank one and sparse at the same time. It's an application in which we have more than one structure for the same object. And also one reason this comes up is in optics, when you're measuring, you're measuring intensity and that's why we can measure the phase. Measuring phases is very hard. It's a bit problem for signal processing [inaudible] in different solutions. Okay. What are our results? We are going to give theoretical analysis of general simultaneous structures. And we are going to show the combined convex penalties has a fundamental limitation, and we're going to specifically restrict ourselves to the case of sparse underranked matrices, show down if I write down a convex and noncovex problem for recovery, unlike the cases we saw for sparse vectors and matrices there's a large gap. So I need to define -- I'll have one definition to make and then one lemma and one theorem. So the definition is this one. We are going to look at particular types of norms -- time is very limited. I'll go fast on this one. But we're basically going to look at regularizers that are norms. These norms have a certain property. And the property of the norm is called decomposable norm at a particular point X is decomposable if it has the following property. If there exists a subspace T which acts like the support and there exists a vector E which acts like sine, such that all subgradients of the norm at point X can be written into these two pieces. So this is projection on to T and this is projection on to orthogonal complement of the subspace T. Maybe it's best to see it in an example. If I'm talking about, for example, sparse vectors like this point in two dimensions, this is the unit ball of the L1 norm and at this point, which is the point 10, the sine vector is going to be this unit vector of 1 and 0. And, for example, if I were at this point it would be minus 1 and 0. It captures this sine. And this subspace T is going to be this axis, T perp is going to be the orthogonal complement, which is this axis. And G is, for example, a subgradient of L1 norm at point X knot. So the norm is not differentiable at that point, therefore I have a whole set of subgradients which is this cone. G is one of those. And this decomposability says G can be written as E plus projection of G on to T perp. And this projection has dual norm less than one. In this case it means infinity norm less than one, means the projection noise between minus one and one on this axis. That's what decomposability means for L1 norm, but actually it holds for a bunch of other norms, too. For example, L1-2 norm, other mixed norms, no clear norm and so on. Using that definition, we're going to have a family of such norms, tau of these norms. And I'm going to just denote the sine vector corresponding by every norm by EI, the support by TI, and most important object we need to define is the intersection of all the support spaces. T cap is the intersection of all the TIs. And projected signs are denoted sign projected T Cap denoted by this, I think the batch may not [inaudible] oh, thank you. >>: Green button. >> Maryam Fazel: So the projection of E on to E cap is denoted by this notation and these are some angle between the E cap and the EI. So what does that mean? In this example, suppose this is X knot. It has simultaneously two structures defined by the first structure has this support space T1, and it has the sine vector E1 that lies in T, T1. The second one has this support space T2 and sine vector E2. XI is in the intersection of the two spaces which is this line, which is T cap. And then the projected sine vectors are going to be these two projected on to the T cap. So the reason you're defining this is that in order to analyze the performance of such methods, we need to have a way to capture the geometry of each individual norm plus the relative geometry, how do they relate to each other. And so we are going to look at this optimization problem, minimizing summation of norms, subject to linear equality measurements. And here lambda I organization parameters. And one result we can show just based on optimizing the conditions of this optimization problem is that if the number of measurements M is less than this bound, then the program will not recover X knot with very high probability. So recovery will fail with very high probability. Now, this quantity is depends on this function if used in the function here, it says you look at among the subgradients, you look at the one with the smallest projection on to T cap. This is not so informative, but let me actually specialize it with one more assumption. If I make the extra assumption that the inner product between all the projected signs are positive, for all pairs I and J, if you're interested in how we justify this, I can tell you later. But it's not a bad assumption we can justify. With this assumption, our result actually simplifies to the following: If M is less than a constant times the minimum of the dimension of the TIs, so among all these Ts I take the one that has the smallest dimension, if M is less than that value, then recovery fails with very high probability. So that's actually the main point of our result, which is the bottleneck and recovery, lower bound on how many measurements you need for it to even be able to recover the answer is going to depend on the minimum dimension of the Ts, rather than one would have hoped it would depend on the dimension of the T cap, because that's where the X knot belongs and that captures real degrees of freedom of X knot. So we can also handle additional call constraints I didn't include in this result. So that is the surprising thing. Our bottleneck is going to be the minimal of the TIs, not the dimension of the intersection as one would have hope. This is about the constant that appeared there. We don't care now. And if we actually specialize this result to the case of sparse and/or rank matrices we will see the gap clearly. So the matrix of size N by N, rank R, which is nonzero only on a K by K sub matrix, has this many true degrees of freedom, true number of parameters to represent that matrix. If I solve the noncovex optimization problem like that, and here this means the number of non-zero columns and this is the number of non-zero rows, I'm using the mixed norms to encourage the block structure. And this is rank. If you solve this problem, we need this many measurements to recover X knot. If I solved the convex relaxation, including the PST constraint, I'm going to require order of N times R. So the gap between the two, this depends on ambient dimensions, linear in N. This is linear in K and only depends logarithmically on N. Very large gap. If you contrast this, for example, so there are more such results and actually we can verify it numerically as well with experiments, but I'm out of time, ask me later if would you like to know. But to summarize, we would like to have regularizers that would recover simultaneously structured object. And most common thing to do is you take a combination of known penalties or norms that promote each of these structures and minimize those. However, surprisingly, such a program has bad performance in the sense that it requires many more generic measurements than what you would expect based on degrees of freedom of the object and contrast this with how cardinality and -- how well 01 works compared to cardinality and how well trace works compared to rank. There were no gaps in these recovery results. But simultaneous recovery has a big gap. And, finally, so this is our result. But we have a lot of things to continue after this. Some of these are, for example, here I assumed Gaussian random measurements, but, for example, in phase retrieval problem measurements have these forms. We need to extend the results to this more general case. We also would like to prove, when we say recovery fails, it fails badly in the sense that the solution you find is at least certain distance away from the true solution. So you're not failing meaning that you're very close to the true solution. Okay. I'm completely out of time. So I'm going to not talk about algorithms and other applications, but if you have questions, I'll be happy to describe more. Thank you. [applause] >> Ofer Dekel: Time for a quick question or two for Maryam. >>: Is there a goal to the -- maybe there's some other way [inaudible] X knot and [inaudible] other than by the combination [inaudible]. >> Maryam Fazel: Yeah, very good question. So that's a question we would like to address. It's not obvious. I mean, you would have to go back and see, for example: Can you define new atoms or unit objects that capture both objects at the same time. Basically you try to describe the intersection space from scratch. And not thinking of it as intersection of this space and that space, because if you do that, we should all such combinations will fail. So is there another approach. In general, probably it's hard to answer, but in specific cases one could approach them one by one and see if you can construct the correct norm in some sense. >> Ofer Dekel: One last question. >>: Question about special case of sparse matrix. If you have [inaudible] variable that's subject to [inaudible] co-variable, so use of special case of structural, could be retrievable special structure or not. >> Maryam Fazel: I'm sorry, I didn't -- >>: Let's say I have predictive model and task. variables. >> Maryam Fazel: I have critical Categorical -- >>: Categorical variable to number of categories minus one [inaudible] and [inaudible] happens only once number of categories. >> Maryam Fazel: Syndicator. >>: Little bit sparsest. Not absolutely sparse. But [inaudible] sparse. Is this a special case when your technique would be applicable? >> Maryam Fazel: I'm not sure. So I'll talk to you afterwards in more detail. But approximately sparse signals have been also studied. Doesn't need to be exactly sparse. And a lot of these results we can say approximately structured. So, for example, close to rank or close to sparse. >> Ofer Dekel: [applause] Let's thank the speaker.