>> John Platt: It's my pleasure to introduce Ben Recht. As you can see from the slide, he is a professor at the University of Wisconsin at Madison studying one of the possible computer sciences. I suspect that is data science. He's nodding, okay. Previously he was a postdoc at Caltech and before that he got his PhD from the Media Lab at MIT and so here's Ben. >> Ben Recht: Thanks John. It's great to be here today. I did tend to pick this title to try to maybe have it be as mathematically intimidating as possible, but I think actually the more appropriate one for today as we don't have to be as kind of lofty as the original title suggested. I just want to say let's give some basic ideas of how we might make predictions when we are short on information, and what I mean by this is I want to talk about a very classical canonical problem that were always faced with, which is we collect a ton of data. We always do this and we would like to make some kind of an inference about the data and even though we have gigabytes and gigabytes or terabytes of data, we still have it being, we still have a lot of missing data and a lot of missing information. Somehow we would like to make inferences about what is not there. So the idea today is how to incorporate structure that we know, what is the appropriate way, or an appropriate way to incorporate structure that we know into these problems to make them well post and such that we can solve them efficiently. So let me start with thanking my collaborators. As you'll see there is a bit of a dichotomy here. I have a lot of folks at Wisconsin who I'm working with on theory and, and I have one person who I think a lot of people in the room know who I am working with on algorithms and I guess we just need one level. Really once you get Chris on a project you don't need more than one algorithm person. He's just kind of a machine, so that's good. So let me begin with a motivating example that is very dear to my heart, and that is a recommender system since the problem I've been thinking about for quite some time, notably because Amazon won't let me stop thinking about it. Every time I go to that website they tried to make me buy stuff based on my click records and of course when I really got interested in this problem was when Netflix offered a lot of money to improve their recommendation system, notably this was a snapshot I took of my page many, many years ago, where they recommended this movie Stalker. It's a Tarkovsky movie. Is it Tarkovsky? I can't remember. It's a Russian movie where basically people walk around an abandoned quarry and throw a rock and yeah, yeah, and so apparently based on liking these other three movies, this was the logical recommendation and the question is how did Netflix make that decision. My wife who is an art and film historian suggested that if you take these movies, these three movies, you put them into a blender and then you skim the plot off the top [laughter], and you get Stalker at the end; that makes sense. Of course, the other place where we have these things all over the place is on internet dating sites, which I've become much more of a believer in since my sister found a very nice husband [laughter] using this service. Now I actually do believe that this stuff works. So in all of these cases and all of these web services, we can only capture a small fraction of the information that we need about all of the users. So the question really becomes how are we actually going to make good quality predictions to keep people coming back. And of course as I mentioned Netflix tried and that's how I got interested in the first place even though I didn't do very good at getting this million dollars, they did offer a lot of money and they got a lot of these machine learning folks thinking about how should we actually go and design large scalable recommendation systems. So these are the seven guys who won. I think everybody knows these guys by now. The brilliant idea here is that they got the entire community thinking and managed to have to pay seven guys a million dollars for 3 1/2 years of work. I think Netflix wins, [laughter] regardless of what the outcome was. So again, just a reminder about how this worked, they gave out 100 million movies. They gave them to you on a scale of 1 to 5. And you had to predict with some high accuracy the number of movies with 20,000, the number of users by 100,000 and if I just multiply those numbers together I get about 8 billion. So I have a matrix that should have about 8 billion entries and then I get to see 100 million and now the question is just what do I do. How do I fill in those missing entries? So the abstract version in the way that we kind of look at the, I mean this is a very common way of abstracting this problem, is to say that we want to complete a matrix. Matrix completion is kind of a classic numerical analysis problem that's been around in lots of different forms for a long time and it's just saying that if I have some partial information, maybe filled in in black here in some matrix, how do I infer the white? So obviously we need some structure here, right, because you could just put anything in there and it'll be a valid completion. In the case of the Netflix problem, a reasonable kind of completion would be a low ranked one, because we all believe that the first thing to try if we have a full data matrix, if we want to make a good predictor, is we’re going to run some PCA, so if we have a missing, if we have a lot of missing entries, maybe also it would be a good idea to try to run a PCA even though the entries are missing. The reason why this is particularly relevant in the case we have missing data is if we just do parameter counting, the X matrix has k times n entries which is that 8 billion I showed, but the factor matrix has r times k plus n entries, which is dramatically smaller. So we have a huge reduction from k times n to r times k plus n, from 8 billion to maybe tens of millions of parameters. If we have 100 million ratings, now the problem almost looks over determined and we just have to figure out how to we actually infer a low ranked matrix. Of course lowering matrices are just everywhere. They're not just in these data matrices on the internet. Rank is a really useful and powerful way of kind of summarizing what is simple about some model or system. For example, if I want to do some sensor network embedding problem, if I look at the matrix of all of the inner products of that between points, in the grand matrix, that will have dimension equal to either two dimensions, if they are lying on the floor, or three dimensions if they are in 3-D space. And in multitask learning where--this has actually been a very popular way of saying that classifiers for let's say the same digit written by different people should somehow have a low rank structure as well, and this actually has been very powerful in trying to train lots of SVMs at once. Where I kind of got into this problem to begin with, I mean I always say it's about the Netflix problem, but really it's because I was hanging out at Caltech with all these controls guys and in controls, rank is everywhere, because it's kind of the way of summarizing the state of a system, so essentially this special matrix that's called a Hankle matrix, the rank of that matrix tells you how many numbers I need to predict the future given the past. I could throw away all my past measurements, just keep a number of parameters equal to the rank of that matrix. So these are all nice ways of summarizing simplicity and of course, it would be nice to be able to find the lowest rank solution of some kind of set, and so this is kind of our canonical inverse problem. We have ax equals b. I am going to tell you that x has low rank and now we just want to solve that. So this problem has kind of the simplest version of the rank minimization problem is actually very hard; it's NP hard, and if we talk to different parts of, well it depends on who you ask. If you talk to the computer size theory people, that might be a good enough excuse to go home, but because not only is it hard, but it's also hard to approximate. There is never any hope of doing any approximation algorithms, but of course a million dollars is on the table and someone tells you the problem is NP hard, would are you going to do? What we're just going to try something anyway, going to try some heuristic. Yes? >>: [inaudible] approximation [inaudible]? >> Ben Recht: I think it's logarithmic, is the best you can hope; logarithmic in the dimension of the matrix is the best that you can hope, especially once you add noise. There are lots of different ways to get at it because you're basically solving quadratic equations, so that could be good and that could be bad, but I think that's the best guess. Moreover, you actually don't care about estimating the rank exactly; you actually care about that decision variable. So what is a good heuristic? What is a reasonable heuristic? So I'm proposing that this thing is a reasonable heuristic which is what I've written here, which is to minimize the norm called the nuclear norm. I'll tell you a little bit more about what the nuclear norm is in a second, subject to the data. Now I say that that's reasonable because if we squint a little bit and again I'm going to come back to it, here is the algorithm for solving that problem. What we do is we factor the matrix x as L times R and I just pick some initial starting point, and then I pick one of the entries that's been given to me in the Netflix prize and I compute this residual, e which is equal to L times R minus Muv. Then I'm just going to update my factors according to this rule for that one model and then repeat, so essentially this is running a stochastic gradient algorithm on this normalization problem. Basically the idea is that this entry is purely determined by the product of this row and this column. So when you go and do that it turns out that you can get a 6 1/2% improvement on the Netflix prize problem and the guys who won ended up using this is one of the baseline algorithms that they combined into their mega-classifier. Actually this gradient descent algorithm, when it appeared wasn't actually called minimizing the nuclear norm. It was just called the SVD heuristic, and it was the one that appeared on this fellow Simon Funk’s LiveJournal page I think about three months after the thing started. So it turned out that he actually was minimizing the new kernel of the matrix with this data. Now the question is why is that a reasonable thing to do. Yes? >>: [inaudible] letters flying around. >> Ben Recht: Yeah, okay, here. >>: So there is X is equal to X is equal to L times R transpose? >> Ben Recht: Uh-huh. Oh, sorry did I leave that off of here? Yes. X is equal to L times R transpose. The thing you're searching for is this guy. We basically made an approximation here that actually gets people to L times R transpose. Once you do that than actually saying X is equal to B is just saying that, so a measurement of X is equal to B is the same as saying we want the entry here to correspond to one of the entries over here, so it's just filling… >>: [inaudible] is like the mask. >> Ben Recht: Yes, it's just saying I am going to give you one of the entries. Correct. >>: [inaudible] U and B. >> Ben Recht: Where is U and B? Oh, it's just a component. >>: [inaudible] subscript? >> Ben Recht: A subscript, yeah. >>: It's like the one that is the mask. The [inaudible]. >> Ben Recht: Yeah. Those are my, those are my, one of my indices; I just pick one of them. >>: So X and M are synonymous? >> Ben Recht: No. M is a true thing. X is the… >>: M is true, oh. >> Ben Recht: Yeah. >>: [inaudible]. >> Ben Recht: Yes. That's right. >>: [inaudible] estimate of… >> Ben Recht: That's right. >>: Okay. >> Ben Recht: That's right. This will be much clearer what I get to, much more clear when I get to how you actually, how you actually show how these two things are the same, but I'll get to that a little bit. I just wanted to… >>: [inaudible]? >> Ben Recht: Huh? >>: Where is the nuclear norm? >> Ben Recht: Where is the nuclear norm here? >>: Yeah, because… >> Ben Recht: Hold that thought. Yeah, hold that thought. We are going to come back to that [laughter]. That's good. No. I actually have--I'm going to come back to that, so just hold that thought. Let me just do two more quick examples where I kind of want to just give us a little more, slightly bigger picture and then we will come back to these examples. So the other--oh, right. John asked for me to talk maybe a little bit about kind of some of the bio nano something or other projects I might've been working with, and I did want to mention this one, where with some collaborators at Stanford we had been looking at trying to detect, trying to find biomarkers for brain cancer, so just to set this up, brain cancer is obviously a terrible, terrible disease and it's particularly terrible in terms of cancer because we have no way of detecting it before you are symptomatic. I mean, you could probably see it on imaging, but typically we don't have any good ways of understanding how to do prescreening for brain cancer. There's no sense of like lumps or polyps and so the hope and actually most of the hope in cancer detection is to say, and in cancer treatment is to say, if we get it early then it is the most treatable. So I've been working with these doctors and they're actually trying to do this model in rats to see what they can see in rats early, pre-imaging, pre-symptoms. So they have this procedure where they treat a pregnant rat with a carcinogen. The offspring are born completely indistinguishable from wild type offspring and then invariably in this green region, they start to develop cancer and you can't see it on imaging, but they can see it because they kill them and then they sliced their brains up like prosciutto and they see these little nests. So there are polyps there. There are these kinds of pre-cancers in the brain. In this yellow region you can actually see them on imaging. On the red region they typically die. The question is can we actually find something without imaging in this green region. The way that they're going to go about it is they extract spinal fluid and extract blood and they run it through a mass spectrometer and they get stuff out that looks like this so sort of a canonical bio marker mining problem. >>: You do this when they are little and then you wait and see if they eventually die of brain cancer? >>: [inaudible] access? >> Ben Recht: The access is mass overcharge. >>: Mass overcharge? >> Ben Recht: Mass overcharge and then amplitude on the y axis, so kind of our classic mass spectrometry, yeah, yeah. >>: So what's the experiment? Like they do this to all the little rats… >> Ben Recht: So this is one of those interesting things when we interact with biologists. We would like to do time series analysis. The way they do time series analysis is they run multiple trials and then take samples where they kill the subject [laughter] at some time in that slice. So invariably when they are doing that spinal fluid thing they also kill it and slice the brain up, so they do it at multiple slices, so actually this experiment is 50 rats. It turns out that rats are terribly expensive, so that actually cost, actually took a long time and cost a lot of money, but in the green region, the yellow region and the red region they would select some rat from the pool, extract the spinal fluid, put them in the MRI and then kill them [laughter]. >>: I see, so if they could look in the green region to see if they had these nests and then you take that as a positive? >> Ben Recht: That is correct. But what's crazy about their model and this is actually one of these interesting things about biologists. In rats, which are kind of genetically diverse as compared to mice; that's kind of another thing you learned. Mice are all exactly the same. They are just like clones. Rats are a little bit more genetically diverse. Invariably this treatment that they give the mother causes cancer. I mean if they give those rats, those rats will… >>: So they are… >> Ben Recht: It's almost 100, hundred percent positive. If they've been treated, they get cancer. >>: They can figure out how to give cancer, but they don't know how… >> Ben Recht: They're very good at giving cancer. They are very bad at treating it [laughter]. >>: So it's deterministic. It's not like you have to worry about what will happen. They will get it or they won't get it. >> Ben Recht: That's right, and that's why, that's why they are so fond of the model. I think this also happens in medical research all the time. If you could find something that is all deterministic, you kind of latch onto it because so many other things are variable, including these mass spec lines. >>: So they just too massive [inaudible] quit? >> Ben Recht: Yeah. >>: And I don't do [inaudible] trying to find the [inaudible]? >> Ben Recht: Well the problem is how do we do, which piece do we look at, is kind of the problem that they… >>: [inaudible]. >> Ben Recht: Huh? >>: [inaudible] died… >> Ben Recht: Yes, exactly. But when you have, when they are doing this cell mass spectrometry and then the question just becomes which of these do I actually, which ones am I going to extract, because the other problem is that not only are the rats expensive, but lab techs are also [laughter] somewhat expensive for these guys, but you are right. I think that the question is you could try to go after all of the peaks, but really the problem here is that the number of peaks is way bigger than the number of rats we are ever going to have, so you're going to have hundreds of peaks and 50 rats took 3 1/2 years, which is crazy. The whole thing, when I say these are days. I'm sorry, I should've said that. These are days, so that's three months to grow a rat to, six months to grow a rat to death, so these things take a while and you can have too many of them and you have to take care of them. And so what are we going to do? We have, we are in again the same situation that we were in before where we have far fewer examples than we actually have parameters, unless, unless, only a couple of those peaks really matter. So what we actually ended up looking at was kind of the standard at one minimization for actually pulling out the markers. So I think most people at this point of the game have seen that that seems to be a popular way to extract sparse signals when you have some data and, whoops. What happened here? Sorry. And of course there has been all of this hullabaloo about compressed sensing for a similar reason, a very related reason, is that what we actually want to acquire data, perhaps we can actually use the fact that the model is sparse to reduce the number of measurements that we actually want to take. Yeah? >>: So you said you also have a control of healthy rats? >> Ben Recht: Yes. I'm sorry, so you have two, yeah. The control group takes just as long to raise unfortunately [laughter], so there's really not a better situation, so you do the exact same experiment. >>: You can't reuse them. >>: You can't reuse them? >> Ben Recht: You can't reuse them and it is actually, it's kind of the worst part is everything you do to these rats does kill them, so you can't even do, the next day they are doing longitudinal studies. They're just hoping that everything happens the same. You either kill it at 30 or 60 or 90 or 120. >>: They can't do an MRI or something? >> Ben Recht: No, they do do MRI, but then if you want to do histology which is this prosciottoizing of the brain, there is not much you can do. Isn't that what it is? [laughter]. >>: Neat verb. >> Ben Recht: [laughter], sorry. I'm getting off track. Let me just jump through this really quickly. So again this is the same problem we had before. We want to solve, even if we want to solve a linear system and find the sparsest solution, that's hard. That's this cardinality minimization problem and that's hard. But now we actually have a whole field dedicated to minimizing the L1 norm subject to some constraints and it actually has now this name for it, compressed sensing. So we have two examples where we pick a norm and it seems to do really well on several different problems. Of course, and I'm going to skip that last slide because you certainly don't need to know this, machine learning is kind of the same thing, right? Machine learning is the same thing. We have a, in the discriminative version of machine learning we have some set of features and then we have some kind of model that we want to fit and we're going to fit some fitness function which could again be a least squared or a linear system, and then we have to pick some function and we have to pick some set of functions to optimize over, and we are kind of in the same situation that we were in before, that the number of samples, or the sample complexity is somehow dictated by a smoothness parameter of this underlying space of functions that we are trying to operate on, and if this space of functions or the function that we are trying to find in our hypothesis space is smooth, then we are going to be able to find it. So again, there are a lot of numbers here, n would be your number of samples. D would be the dimension of the thing that we are looking for and S would be some parameter which is called the smoothness parameter. If that smoothness parameter is large enough we will actually need substantially less samples then the [inaudible] cardinality would indicate. So this is kind of one of those classic arguments that we get in neural networks, kind of the same situation that we been in before. In all these cases, I'm going by that fast, so we don't dwell on it to quick, because I know Leon's going to bug me. In all these cases we have [laughter] complex systems. They generate, we want to generate some predictions and the only way to make some sense of that is to leverage some structure. So here is the problem, I'm really going to focus down on this one simple problem, for the rest of today. I want to find a solution of Y equals five times X and this is under determined, so M is less than N and let's just assume to make ourselves happy that there is a solution. There's one solution. So once there is one solution, we have an infinite number and if there is an infinite number, which one do we pick? So as I said, what we want to do is leverage some notion, some structure, some kind of baseline structure and that could either be sparsity. That could be rank. That could be the smoothness. That could be some kind of symmetry. I'll tell you a few more examples of things that we can leverage and the question is is there a way to just do this with a crank? Meaning that you tell me which of these boxes I have, which structure is present and then I'm just going to give you a reasonable algorithm, some prototype algorithm that maybe we could just solve right off the shelf without having to think too hard, and where I can actually predict for you what your sample complexity should be and where I can give you bounds on the error. So everybody is okay with at least that set up, rather than these motivating examples? So let me just go through and do this with cartoons. All of these previous results that I kind of went through in a haphazard fashion more derived. So the first one is just where does this L1 norm stuff come from? So here are our one sparse vectors of Euclidean norm one. If we draw in the convex hull we get a norm. It's the unit ball of the L1 norm. And if I think about what does it mean to minimize the L1 norm subject to some equations, well, basically I have my space phi X equals Y, that's my set of equations. And if I want the minimum L1 norm solution what I will do, is I will take the L1 ball and I will inflated until it hits. Lo and behold it hits on one of these corners, which happens to be a sparse solution in this case. And this picture is pretty much directly stolen from the first compressed sensing paper by Candes, Romberg and Tao. This is exactly the motivation for why that should be a reasonable model to use to find a sparse solution from some math find set. Now rank, now I'm going to get back to [inaudible]. Here are my 2 x 2 matrices, plotted in 3-D. So again I can only do these little cartoon so again, this is about the biggest matrix problem we can look at with a really good picture. So we have xy, yz. The set of all rank one matrices that have unit Euclidean norm are these two circles. If I shade that guy in, I shade in this convex hull, that turns out to also be the unit norm of, unit ball of a norm. In this case it is the sum of the singular values of the matrix. And again, we might expect just by the same shape if we we're going to go and try to minimize this norm which is called the nuclear norm, subject to some linear constraints that we'll hit on the boundary of this ball somewhere, just some dilated copy of the ball, those boundaries are the places where we have low rank solutions. This was kind of the basis for how we analyzed this in our papers on matrix completion and rank minimization. Again, it's kind of the same picture in that case. Now to just, I could've jumped ahead, but I'm just going to do it right now, to say how does this relate to that stochastic gradient heuristic for solving the Netflix prize problem? Let's just do a quick run through again of this type of problem. I have the nuclear norm of X is equal to the sum of the singular values of this matrix, and then I'm going to have just my phi X equals Y as my generic linear inverse problem. Now let me parameterize X as L times R. If X is a singular value decomposition, U sigma V star, I'm going to pick a particular factorization, U sigma to one half, V sigma to one half. That is a low rank parameterization of the true solution, or of the decision variable. Let me plug that back in, plug-in L and R where X was and it turns out you're left with the Euclidean norm of L plus the Euclidean norm of R squared, because U, again going back to our linear algebra days, this thing is a orthogonal matrix times a diagonal matrix times an orthogonal matrix, so if I look at the sum of the squares of L, that's just the sum of the squares of Sigma to the one half, which is actually just the sum of Sigma, which are the sum of the singular values, same for the R. So now it's much simpler and then if I just take this constraint and add a Euclidean, sorry a Lagrangian penalty here, we get kind of your standard regularized version. We have a sum of squares. We have a penalty in L2 and now that's nice and smooth and you can run gradient descent on it. >>: So are you adding [inaudible] both L and R for stability reasons? Because both of them yield the same thing. >> Ben Recht: In the cost function? Yeah, as opposed to just letting R be free? Good question, that's a good question, because it would be symmetric, because they are equal at the optimal value, so maybe you just have to penalize L. I haven't tried it. It would be worth, you just take one out and just see if you can only regularize one of the factors. But we just do it for symmetry really to not preference one or the other. So this heuristic here actually was doing, this thing that was the successful and common sense thing to do is actually solving this nuclear norm problem. >>: Without realizing it. >> Ben Recht: Oh yeah, and we didn't realize it either for quite some time [laughter]. Hindsight always being 2020, of course, they were always doing the right thing, very principled way of solving these things. I guess that's kind of the point of what I want to get to is you do the right thing most of the time without realizing it. Is there a way to actually, if you have something that you don't know how to do, have this crank to turn. Yes? >>: [inaudible] question. >> Ben Recht: Yeah, go, go. >>: Multiplier usually has… >> Ben Recht: Lagrangian terms. >>: Well, there is the Lagrangian term and then there's a stabilizer term. You've written down a stabilizer term. >> Ben Recht: Yeah. >>: This is the penalty method; [inaudible] this is really what they are minimizing. It's not, [inaudible]. >> Ben Recht: Yeah, that's fair, that's fair, that's fair. >>: So there's no lambda times and then [inaudible] squared menu you leave wiggle lambda around with a… >> Ben Recht: It actually works better if you put that on there. >>: It works better? >> Ben Recht: If you add the Lagrangian term. I mean, you are right. You could have a Lagrangian multiplier. >>: Yes. >> Ben Recht: And that does work better. >>: So this is just the penalty method? >> Ben Recht: They are doing the penalty method. We actually, our code actually uses the Lagrangian. >>: Yeah, okay. >> Ben Recht: That is fair. So let me give another example. I could keep you in these examples all day. What happens if I want to solve an integer programming problem, but now it's a weird engineering programming problem? I'm just saying I have an inverse problem I'd like to solve. It has plus or--I know the solution is all pluses or minus ones. So maybe I can multi-knapsack type problem. In that case, again, this is the right corners here are the integer solutions. I shave the convex hull. I get a ball, the unit norm. The unit norm ball of the infinity norm in this case, and again, I have the exact same picture as before. Blow that thing up and it hits the affine subspace, on the corner in this case. Again, you analyze again this structure and you get the results of Donoho and Tanner and Mangasarian and myself. We were analyzing this for the case of the multi-knapsack problem. So in all of these cases what do we have? We have a model, this X thing that we like to fit. We decided that we have some reasonable notion of simple models which we would like to construct X as a short some of those simple models. The goal is just how do we actually find the shortest decomposition? In all three of these cases now it might not have been clear what we were doing was minimizing the sum of the absolute value of the coefficients subject to this decomposition holding, because that's all it means to be blowing up the convex hull. You're just trying to find the smallest sum of coefficients that touches the affine space that we are trying to search through. Now the only difference between, this looks like it is just the L1 norm, but the only difference is this set of atoms can be not a discrete set. In the case of the new unit ball the set of atoms were those circles. This is manifold or a union of manifolds. And really these items can be anything. Let me just give you a couple of more examples where these atoms might not be quite as clearly just a basis. So for example, union of subspace models have been very popular in a variety of contexts lately. In machine learning they've been mostly for like filling in image patches where you have some kind of hierarchical structure that you would like to build upon. Francis Bach group has been doing a lot of stuff on that. In Cobb’s weightless theory you kind of have these subspaces where you know that basically if one of the leaves is active in a wavelet tree, the whole path up, all of the ascendants also should be active in some kind of wavelet decomposition. In multipath problems where you are trying to do de-noising, basically you have some known signal and then you have copies of that known signal and you have just a union of a blot of onedimensional subspaces. So in this case we just have a bunch of subspaces. The atoms are the unit balls in each of the subspaces, take their union, and in this case it is not a discrete set, but if you look at the recent work by Francis Bach, the algorithm that they use does end up actually corresponding to that atomic norm I wrote on the previous page where you just do this blowing up of the convex whole. >>: This is [inaudible]. >> Ben Recht: Not that stuff, not that stuff, as people have noticed, there are 100 algorithms for union of subspaces. The one that we are talking about--oh, can I write on the board? Okay, good, the one--maybe everybody will be able to see over here. The one that we are talking about minimizes, you say X is equal to the sum of VG where G is living in one of the subspaces, summed over the subspaces and you take the if of the some of the absolute values, I'm sorry, the L2 norms of these guys subject to that equation holding, and this is the norm of X. This was proposed by Bach, Obozinski and Bahr I think was the paper that did this one. Francis has like eight different ways of doing union of subspaces. That's the one that corresponds to the one that we studied. I'll tell you later this is the one that we can actually analyze really well, so we get substantially better balance than what you can get without, well than what I've seen with all of the other models, for what it's worth. I did flash this one up. This has been more and more important since I moved to Wisconsin. Everybody wants to talk about how their football team is better than everybody else's football team, and so in this case we have some kind of measurements, some permutation matrices and would like to find maybe good mixtures of rankings. And in this case the models were the atoms would be permutation matrices and if I look at the convex hull of those guys, it's something called the Birkhoff polytope, which you can optimize over each efficiently and actually more efficiently than treating each permutation matrix as a basis element. And then, I mean I can keep going. There are lots of other examples whether they be moment problems or problems involving matrices were problems involving tensors. In all these cases we can follow our nose and see that we can solve these atomic norm problems but they are not L1 minimization. In the case of moment problems which come up in system identification and numerical integration and a lot of other things, you end up with semi definite programs. In the case of the tensor stuff you're going to get weird alternating least square things. Everything is hard in tensors. In cut matrices we end up having for people who are familiar with Nati Srebro’s work, you end up with something called the max norm, which is an approximation to the norm that is induced by cut matrices is quite hard. The cut matrix is simply a matrix that is ranked one and all plus or minus one, so you can't actually optimize over that efficiently, but you can approximate it very well using something called the max norm. So all of these cases we actually have to use different technology, but we can do it and again we have a crank that we follow once we understand that these are the simple models and then we want to minimize this convex hull norm. Indeed we have this funny name for it called atomic norms because I spent too many years working on nuclear norms and I figured this would be the next one. Then we will have molecular norms will be after that, right [laughter]. And then we go on from there. >>: Then we'll go down to and quarks and… >> Ben Recht: Well, that would be the other way, so we were at nucleus so we can go the opposite way too. >>: That would be great. >> Ben Recht: The atomic norm is just the thing that I've been telling you the entire time and it's called X of A, which is the same notion I have on the board here. I have a basic set of atoms, and what I want to do is define this function norm of X of A, is just going to be the smallest t such that if I blow it up, it hits X. If I blow up the convex hull it hits X; that's the atomic norm. It's also called the gauge function for people who had to suffer through some convex analysis or functional analysis; that's where you may have seen it. If A has a couple of nice properties including that its fold dimensional and center symmetric you end up with norm, which is this L1 norm looking thing. Again, this A set could be infinite though. So here is my prototype algorithm, at least the thing that I've been promising you for far too long. The prototype algorithm is minimizing this atomic norm subject to the data that I've been given. That seems reasonable, and the question is when does this work, and the second question is how do we solve it? And these are two reasonable things to get it. So we have this convex hull norm. It's just, and all you have to think of in your head is that we have that same picture as before. We pick our atoms, we take their convex hull. Would blow up the convex hull until it hits the affine set and when does that work? Yes. >>: The constraints you have [inaudible] inequality constraint and so you assume that your data is perfect or how do you… >> Ben Recht: For now. We'll get to, we can, the analysis is slightly different when the data is not perfect, but let's at least start with perfect and then we'll add noise later. The problem is once you have noise then there are lots of ways to add it, so we'll talk about two different ways, but let's just start at least with the simple thing where--because even there, it's still not clear that we can do this, or that it will work. >>: [inaudible]? >>: Yeah, I was going to ask that. What [inaudible] metrics? >> Ben Recht: That if I have an atom, minus that atom is in there. >>: [inaudible]. >> Ben Recht: So the two things that you need is that the convex hull contains a full dimensional, like a little small full dimensional Euclidean ball. That's one thing you need to get a norm, because otherwise we'll be flat and you will have these points with infinite mass so it has to be able to span the entire space, and the second thing that you need is that if A is in there then minus A is in there. Most of the time we don't even need that as a norm. We are really just using the gauge function property. We're just using the blowing up of the convex hull, but it is a norm and it has this kind of L1 norm looking thing if it's cetrosymmetric. >>: [inaudible] combination is nice because it, you get the [inaudible]. >> Ben Recht: Right. So if you are happy to just add in the negatives of your atoms, for every atom, just add in the negative ones, like if you have a permutation matrix, you add in the minus permutation matrix; you could do that, if that makes sense. >>: So you proofs are going to be that it is actually a norm? >> Ben Recht: No. It's only going to use that gauge function property. The proof is only, and I will show you here again. The proof, what's nice about this analysis is the proof is kind of universal and pretty simple. So we need to get a little bit of math, but it's not so bad. We're going to find something called a tangent cone. Imagine this is my ball. This is my convex hull and X is the point I'm looking for and I'm promised it's out there. What I'm going to do is define this thing called a tangent cone, which is a fancy word that if you read Terrel Rockefeller's convex analysis book, that's where you will see it, but really all it is is all of the points, all of the directions that make the norm smaller and that is a cone that goes off to infinity, so it's all of the directions that make the norm smaller. From looking at this problem when is X the unique optimal solution of this minimization problem? Well, it is basically, if I am at X, X is feasible so X starts off being feasible and then if I want to stay feasible I have to move along the null space of phi. I have to move in the null space of phi, and basically X is going to be the smallest norm solution if any direction I move in that null space increases the norm. This is all very tautological and so basically this is the tautological statement. X is the unique minimizer if the intersection of the cone with the null space of phi equals zero. So you asking were asking about whether I need to be in norm, so I don't, right? I actually don't even need to be the atomic norm. This is true for any function. I define all of the directions that make the function smaller and then X is the unique minimizer if the intersection with the cone of the null space equals zero, assuming f would have to be come backs; that would be the only thing that we need. So it's kind of this tautological question and now it just reduces everything to when does a subspace intersect some cone at zero? So in order to characterize that, realistically characterize it somewhat generically we use something called the mean width it, which I think is a really cool geometric idea that I didn't know about until I started working on this problem. So we all kind of know what the volume is. It turns out that if I take an object and it is in d dimensions, and I multiply it by t then the volume is going to increase by t to the d. The mean width is the same thing except it's the measure that increases by t. If I multiply it by t, it increases by t. So who knew that existed? The way you define that is you use a little bit of optimization mumbo-jumbo, so the support function is just a maximum value of d.X where X is restricted to be our set. So I start with a direction and I just maximize it, and if I look at the negative direction, it's going to go over here and that thing is the width of the set along the direction d, so if I project along d that is going to be the width is the difference of this thing plus its negative. Now let's integrate that thing over the sphere and that's called the mean width and its conical volume; it's actually what's called an intrinsic volume. I'm going to look at that cone, and rather than measuring its volume directly, I'm going to measure its mean width and the mean width is totally going to dictate how many measurements I'm going to need to identify models. That is due to this theorem… Yes, please? >>: Can you give some indication as to what kind of shapes will have large or small mean width? Is it when they are kind of sausage like they will have large mean width and if they are like isotropic then they will have small mean widths? >> Ben Recht: No, if they are isotropic they have high mean width because the average of the width and if they have like a line, they will have a very small mean width. >>: [inaudible] which one? >> Ben Recht: We go to average them all. >>: Average them all? >> Ben Recht: Yes. >>: [inaudible]? There is a direction where you always reach the norm. If you go towards the origin. >> Ben Recht: That is correct. So how many directions--there is always one direction that you go towards the origin. That's one direction. How many directions are there? Because what want to do is basically come up with conditions under which if I pick let's say a random subspace, because of course we need randomness as a little bit of a crutch here, but I pick some generic subspace, then what is the probability that that direction back to the origin lies in that generic subspace? If you only have one direction the probability is going to be in credibly small, and if you have a very narrow set of directions it's going to be very small too. >>: [inaudible]? >> Ben Recht: Huh? >>: The width along that direction is [inaudible]? >> Ben Recht: Of the cone? >>: The cone… >> Ben Recht: Yes. But will the cone intersect unit sphere? No. >>: Are you looking at the intersection of the cone and the [inaudible]? >> Ben Recht: That is what I am going to do. That is what I'm going to do, yes. >>: One other question? >> Ben Recht: Yes? >>: Is there a way to think about [inaudible] easy versus hard when they do [inaudible]? >> Ben Recht: No. Actually I was going to get to that. It's always hard. It's awful, but we have good ways to bound it. It's really a nasty interval, but we have good ways to bound it and that is kind of the heart of what we do. Yes? >>: I am trying to just get some intuition about this [inaudible] definition of width. >> Ben Recht: Let me go back. >>: So how is it related to kind of the relationship between the bounded [inaudible] and the bounded in the convex hull? >> Ben Recht: Oh, so if you take the size of the John ellipsoid? Is that what you're saying? >>: The John ellipsoid it's always bounded by the square root of the… >> Ben Recht: Yeah. >>: But if you take [inaudible] as opposed to ellipsoids, so now the, you don't have the nice bound of the John ellipsoid, so it's kind of… >> Ben Recht: Yeah, yeah. >>: So it's kind of… >> Ben Recht: You take the smallest ellipsoid sphere and then you blow up the biggest enclosing sphere. >>: Yeah. >> Ben Recht: I don't know. That sounds like, do we know how to characterize that, that number? >>: You know, it's [inaudible]. >> Ben Recht: It's a number. It's a number. >>: [inaudible] asking you… >> Ben Recht: It's a number. Yeah, yeah, yeah, yeah no, I don't know. That's a cool question. >>: So this mean width in learning theory we just call it the Gaussian complexity. >> Ben Recht: Uh oh, yes, that's right [laughter]. That's right. I was going to get there. >>: Oh, you were going to come… >> Ben Recht: Yeah, yeah. I was going to get there in the next slide, but that is exactly right. So in learning theory we call it the Gaussian complexity; that's right. And why is that? It's a little, complicated. So here we are doing the interval over the sphere and just imagine you add, so if I add the length of the Gaussian in front now I have the interval over the sphere and over times the Gaussian I can now have this width be the average over all Gaussian directions. That was the Gaussian complexity exactly, and it turns out this is not, I don't think it's completely obvious. But it turns out that there is only one function that, and it's called an intrinsic volume up to a constant, so if it's essentially isotropic over all directions, it's either the Gaussian complexity or it's the mean width. If you feel like you have a good intuition for Gaussian complexity, that's all this is. I'm not sure I have a good intuition for Gaussian complexity either. It's just that I kind of know how to play with it. I know how to bound it [laughter]. And we know that it's related to all of these facts like it's related to logs of covering numbers and so we kind of--it has all of these kind of natural geometric characterizations. I don't know about the one you talked about though which is kind of cute, which is the smallest enclosing sphere, not ellipsoid ratio to the largest enclosing sphere. >>: But that's the type of margin type of arguments, so if you replace this with Gaussian complexity and then you have what's the Gaussian complexity of like a margin based linear classifier, that's exactly that, so that will give you the answer. Or maybe [inaudible] the bounds… >>: It's also the measurement of how isotropic or not isotropic is your shape right? >> Ben Recht: Right. Exactly, and especially for cones, if we think about it, I mean if you think about it:, it's never going to be that isotropic, so we're going to take a cone and intersect it with a sphere, so first of all is going to have this nice origin. This is nice and unique and then we just have to say how wide or how much volume or how much kind of width are we spanning out here? And it turns out again, as I was going to say, there is a really nice theorem by Gordon which is based, or, this follows also from, if people know Slepian-Gordon, you can kind of prove this in the same way if you know Slepian-Gordon. If you don't, just trust me on this. This is a nice theorem. If I assume that I have a random subspace, I'm going to say what is the probability of random subspace intersects some convex body that then is going to, the probability is going to be, the probability that it intersects at the origin would be very high if the co-dimension of the subspace is bigger than n times the mean width squared. Or if you want to over it, it's just the Gaussian complexity squared. So n times the mean width squared is equal to the Gaussian complexity squared, if you are more comfortable with Gaussian complexity. And if we want to think about, let's go back to our inverse problem, after this bit of a digression, for inverse problems if phi is say just a random Gaussian matrix, it doesn't have to be necessarily random Gaussian. There are other models, then the number of measurements that you need is going to be n times the width of this tangent cone thing intersected with the spheres. Now the question is how do we compute these things. Well, you go talk to Over; he's got tools [laughter]. You go ask people who know how to do things with Gaussian complexity. We have a trick, the method we proposed to bound these things in the paper uses a trick from convex duality that actually makes it pretty easy, but only for these problems that come from inverse problems. So if you note that the duel of the support function is that the distance that's the trick that we use and it turns out to make some of our computations pretty easy. Before I say what the consequence of those computations are let me just say that the width also governs noisy recovery in some sense, so let's do one simple noise model rather than doing like a stochastic noise model, let's do a worst-case noise model. Let's assume I see not only, I don't see phi X; I see phi X plus some disturbance and I am going to bound the disturbance in L2. In that case maybe I would do the second order cone problem or maybe I would do a lasso like looking problem or a penalty type method problem, and in this case we actually have another bound which says that let's say I take the optimal solution of this guy and then I want to compare it to the true thing that I am looking for, so I am going to look at the difference between X minus the X hat. Then that is less than two Delta over epsilon. Remember Delta is the norm of the disturbance and epsilon is this little fudge factor that we put in the denominator, so if you want to make epsilon big as epsilon goes close to one this is going to blow up, but if we just pick epsilon equals a half we are within a constant of where we were before, so we get robust recovery after a constant factor more samples than what you need in the case to get exact recovery. We can analyze this with stochastic models as well, but I'm almost out of time, so we're not going to talk about that today. >> John Platt: You can keep going. >> Ben Recht: Well, you guys kick me out when I'm done. >>: So can we look at this--I mean I'm just having a little bit of difficulty processing this [inaudible] stuff because it just doesn't look like the standard way I'm used to. So like for one of these concrete, one of the things that are really a motivating problem, do you get the same… >> Ben Recht: Check this out. This is really cool. And we could go through this. If you guys want me to pull the screen up, I can derive all of them for you to if you want. They are actually very easy. This is actually crazy; we do this duality trick and the duality trick is nothing more than saying if I want to know what the width of the cone is, the width of the cone is actually equal to, or sorry is upper bounded by the distance to what is called the polar cone. So this width, this means width is upper bounded by the distance of the polar cone. If I want to compute the expected value of a distance to some set, I could just pick a point in the set and that's what we do. So we pick a point in the set and we come up with some onsauces [phonetic], and we try to be clever and then with very simple arguments, for example, the hypercube we get exactly the rate that is known. And this is a little weird rate. It actually makes you think that actually we are carving the corner of hypercubes, actually takes a lot of measurements and right, that makes sense, because just for me to transmit them to you in some efficient way I need n over two numbers. What I would do is I would tell you whether or not I'm going to tell you where the positives are or the negatives are, and then I would tell you where their locations are, so even just to encode that I need a number two. >>: So the width [inaudible] the width of the scales is dimensionality and so you [inaudible] width squared you get stuff [inaudible]'s scales [inaudible] n cubed. >> Ben Recht: That's what's weird. The width square scales to the square root of n. >>: The square root of n? >> Ben Recht: Yeah. It's kind of a bizarre; it's kind of a bizarre thing. The width of the sphere is root n. It's just like the mean norm of a Gaussian, so as far as the square root of n. >>: So you would expect a lot of these [inaudible] n squared. >> Ben Recht: No it's n. >>: But isn't there an n times the width squared… >> Ben Recht: Sorry, sorry, sorry, sorry. I'm sorry. The n times the width, sorry the--let's go back. Now I got all… This thing, right? N times the width squared is going to, basically, the biggest that is going to be is n. The width of the sphere, if I put the width of the sphere in I get one times n. Sorry, sorry. I was putting those two things together. And remember--what were you thinking? >>: No, no. Confused. >> Ben Recht: Okay? Okay. I am too now. We are all confused. Does that make sense? >>: I was expecting super linear sample of what comes out of this because the width squared… >> Ben Recht: Let me just tell you the punch line. Let's go to the punch line. Sparse vectors, we do the same thing, and again, in a couple of lines; it's not hard. And again, if we want to stick around I can actually show you how to prove it for the sparse vectors. You've got 2s log n over s plus 5s over 4. Now why do I put in these constants? I put in these constants because too many people were compress sensing and they yell at me if the constants aren't there. These are the best [inaudible] constants I've seen that I know of for compress sensing with Gaussian matrices. Even the Donahoe and Tanner result which actually gets this 2 is only in asymptotic sense. This is not asymptotic and to compute this you basically just have to remember how to integrate the Gaussian q function, which is something I always forget how to do. That's all we have to do is estimate the q function. >>: [inaudible] positive there's a structural solution? >> Ben Recht: Yeah, the model, yeah. That's correct. >>: [inaudible] assumptions. >> Ben Recht: Correct. We always assume that there is some structure. Similarly, if we have a union of subspace model, one of these ones that we were just discussing that has, and we're using this norm, and let's say there are m subspaces or m groups and I'm going to do group lasso and the largest size of the group is going to be b and then we have k active groups. The sample complexity grows like this, and it's not the most elegant expression but actually if we think about limits it makes sense. And the limit of that block size is much smaller than the log of the number of groups. This is just k times 2 log m. This is what we have over here up to that. We don't have the divide by k; but that's okay. And then limit that the block size is huge and there aren't that many groups we just get k times b and that is the number of parameters. We have that kind of interpolation between those two. Yeah? >>: Another question, so your [inaudible] it seems you make this assumption that [inaudible] matrixes from a Gaussian random matrix… >> Ben Recht: Yeah these are all for Gaussian random matrices. >>: Okay, so I mean in machine learning examples, for example the 5 is given, so like the amount [inaudible] instead of compressing, so is this the [inaudible] price? >> Ben Recht: Yes. We don't get these numbers anymore; I mean obviously you don't get these numbers anymore, but we can actually handle that. The reason I was bringing this up again is to say if you can compute the mean width tells you what happens for Gaussian matrices or essentially for generic matrices. To go into details as to what happens when you want to do the deterministic design, you have to do something else, but again, we can just use the same machinery and get at least a generic crank for how you compute those things. And in that case the rate is controlled both by how your phi matrix basically scales with respect to all of the models, all of the atoms, and is controlled by the Gaussian complexity of basically the complexity of the noise and the dual norm. But we can talk more about that at some point. I'm almost, I am out of time but you guys just tell me when you want me to stop. >>: [inaudible] [laughter]. >> Ben Recht: I'll keep going. Just one more. This one is really cool. So I take lower end matrices and in Gaussian models, and so maybe not the realistic thing as [inaudible] was pointing out, but still we want to say Gaussian model. Lower matrices, n one by n two matrices. We want to recover them using this like blog heuristic thing. Rank is r and the number of parameters you need, the number measurements you need is 3 times r times n1 plus n2 minus r. Again, I'm being very specific about the form and the reason I'm being very specific about the form is this is the number of numbers you need to write down the singular value decomposition. This is the number of parameters in the singular value decomposition. You've just got to think about how many parameters there are in the two orthogonal matrices and then you have the r singular values. >>: So you are saying [inaudible]. >> Ben Recht: Well, up to the three. Sorry, up to the three, so it is within a factor of three, yeah, the best you could do, which is pretty cool. And again these are, and you can go and you can get crazy with integrals and we can actually do this just for generic cones and you get a nice, and this is actually why it's good you actually just do the analysis for the Gaussian case to see how, is the scaling reasonable just in general. I think the answer is yes, because if I have a general cone and I have this polar cone c *. Polar cone is basically all the guys who have a negative dot product with the cone. That's all the polar cone is, so if you, yeah, see if you try, [laughter] that's exactly the picture; so if you have a cone, the polar cone kind of tends to look like this, kind of shoots off the back there. Essentially if the cone itself is narrow, the polar cone is going to be wide and essentially we get something in terms of the surface volume of the polar cone intersecting the sphere. It looks a little wacky but if you look at the corollary and let's say we have a polytope and I want to use the, now instead of having some bizarre atomic set, I really just have vertices; I have some polytope. And I'm going to assume what's called vertex transitive which basically means that it is just a symmetric object. If I look at all of the automorphisms of the polytope, all of the rotations that send it back to itself, I can send any vertex to any other vertex. So the permutation matrices are an example. The cut matrices are an example. In this case the number of measurements you need is like nine times the log of the number of vertices. I don't know if that nine is real, but it's nine times the log of the number of vertices. >>: [inaudible] cone [inaudible] sharp edges only if, pretty much only if you're [inaudible]? >> Ben Recht: So it's not quite true because even those lower rank guys have sharp, you can get sharp cones. They are partially sharp and then partially smooth. You get these bizarre combinations of the two. >>: Okay. But if the set of [inaudible] is finite it's going to be like that? The convex [inaudible]. >> Ben Recht: Correct, correct. So what happens here? Keep going. For permutation matrices, n log n—huh? >>: [inaudible] converse is lucky. >> Ben Recht: The converse is lucky, exactly. >>: [inaudible] likely. >> Ben Recht: You think it's likely? >>: If you have an infinite set of [inaudible]… >> Ben Recht: That the cone will actually not be… >>: The cone is likely not to have only sharp edges. >> Ben Recht: Only sharp edges. >>: [inaudible]? >> Ben Recht: I think the converse is true, actually now that I am thinking about it. Lynn will have to help me. If all of the cones, if all of the normal cones are polyhedral cones, then the body is probably a polytope, right? I look at all vertices and I take for all extreme points, I'm in. >>: Yes. But it doesn't take [inaudible]. [multiple speakers]. >>: [inaudible] finite. >>: [inaudible] subset within the… >> Ben Recht: Yes. Sure. Sure. >>: [inaudible]. >> Ben Recht: Sure. I see. >>: [inaudible] add vertex [inaudible]… >> Ben Recht: Yeah, once you add vertex transitive, then I think we are done. So this basically means that every atom could be mapped to virtually every atom just by some rigid transformation of the convex Hull, so at that point they are just vertices. Sure. Cool. Oh, good. >>: Algorithms. >> Ben Recht: I was always trying to predict what's coming next. All right, that's it, algorithms. Let me… I have a lot of stuff on algorithms, but I'm just going to do just a quick two slides of highlights and then we can skip the, we can talk later or if you guys want to hang out I can tell you more about what we're doing algorithm wise. I did want to say at least what is also nice about this framework is we have a way of predicting measurements. I didn't tell you how to deal with the stochastic noise in kind of a lasso model like this. But if you want to stick around I can tell you about how we do that too, and this is a more recent development, but what I want to say is what is the algorithm. I already hinted at with the algorithm is. It's always the same. It's always the same and it's basically this you run that iteration. You run this six-point iteration, and fine, maybe we have to change A to sometimes the learning rate and fine, maybe we are going to do this approximately and fine, maybe we're not going to use all of phi, but we are going to do stochastic samples, but in some sense it is always this. The easy part is this internal residual which is just saying how bad, I compute this thing saying how bad is my current prediction to the thing I want to predict, and then I have an adjoint operator which maybe if it's not least squares it will be some kind of nonlinear thing that will happen here, but it will be something along these lines. I have some steff [phonetic] size. The only funky thing is this pi operator which I'm just calling generically the shrinkage operator. This is really the only thing you have to change if you change the prior. >>: If you change the norm. >> Ben Recht: If you change the norm, you change your structure. And this one is the proximal operator of the norm, exactly. And that is the only thing you have to change. That is the thing that you have to understand how to solve, and this is the thing, this is nice for us as algorithm designers because we could forget about the other part for now, because we've all studied the other part pretty well, and we can now apply this stuff where you guys have looked at lasso problems. Well, anything that works for lasso problems should work here as long as we can compute the proximal operator. There's just got to be able to compute the proximal operator. Now of course, how do we do that? So either this thing will be easy to compute, which will be lucky. So in the case of L1 we know how to do it. In the case of nuclear norm we know how to do it. In the case of that factor version of the nuclear norm we know how to do it, that weird [inaudible] type thing. Sometimes we don't know how to do it. In that case what we can rely on is relaxations of a various kind. So the first one is just to say if I take the set that I start with and I put it in a bigger set that means that the bigger norm will be smaller than the norm I started with. That makes sense. Like the L1 ball, the L1 norm is bigger than the L2 norm, for example. And so basically if we had some way of constructing a, I start with my atoms and I have some way of embedding the atom out then I can at least solve the big one, then maybe hopefully I didn't blow it up by too much and this will work okay. So we have a hierarchy of ways of doing this based on semi-definite relaxations. They have a fancy name called theta bodies because they are semi-definite relaxations of convex hulls, but for these basically we can actually go and get tighter and tighter bounds on these atomic norms for very, very general structures. Yes? >>: If you had the case where the convex hull [inaudible] so your cone is going to be [inaudible] and if [inaudible] : is going to be [inaudible] and your next [inaudible] you [inaudible] in that case. >> Ben Recht: Okay, one more time, one more time I don't think I got that. >>: Suppose you [inaudible] X on your [inaudible]. And you look at the cone and the cone intercepts a specific model X [inaudible] direction and you can take the [inaudible]? >> Ben Recht: Yes, yes, yes, absolutely. >>: So that the shrinkage operator should always be solvable within NP. >> Ben Recht: Only if it fits polyhedral. If it's a polyhedral, then absolutely. Absolutely it is solvable by an [inaudible]. Absolutely. And moreover the algorithm you are talking about is the homotopy algorithm. It is the simplex algorithm, the parametric simplex algorithm. It's what people do, so this [inaudible] has been invented many times in L1 but if this is the L1 norm, we do Leon’s algorithm where you just do you started a vertex. You find your decentered action. You find that you are piecewise linear for a time until now the space shifts and so now you have to switch faces. That thing is called, it has a lot of different names. >>: Lars. >> Ben Recht: Lars is, the reason why I don't say Lars is that is actually not the right way doing it [laughter]. Lars was invented by, I mean arguably brilliant statisticians at Stanford, but they didn't know linear programming, and actually the first instance of this was pointed out to me by Levan Vandenberg, that algorithm for L1 was invented by Wolf in 1956. >>: Frank Wolf? >> Ben Recht: It's not Frank Wolf though. It's this parametric simplex algorithm. And what's cool about that, what's also nice about that algorithm is that it essentially allows you to compute if this thing is polyhedral, it allows you to compute this optimization problem for every value of mu. >>: [inaudible]. >> Ben Recht: Then we are out of luck [laughter]. Then we do this [laughter]. >>: Going back to the [inaudible] I'm missing something. So you are saying that you can't compute this, you can't compute [inaudible] operator because it's too hard for your A, so you just substitute your A? >> Ben Recht: Yeah. I'm just going to make a different norm. Yeah. I just make a different norm. I substitute A with maybe something a little bit bigger. That gives me an outer approximation and it turns out if I only want to go with an outer approximation, let me tell you two things we can do. I don't think I can solve the second one but I did solve the first one. Thing one that you can do you can just approximate the convex hull from the outside and basically we have a way of doing that using semi-definite programming, and if you like say apply this to cut matrices, these rank one sine matrices, and you apply this machinery, you get this thing called the max norm which is something that Nati Srebro the kind of really popularized and we have been working off of for a while. Now the second thing that you can do which is actually something I have been very fond of lately, and I know I didn't put this slide in but I will just tell you what this is. You can get an approximation from the inside as well. You just take a subset. In fact, you just grid your atoms. Now sometimes that doesn't work, but sometimes that works just fine and sometimes that actually ends up giving you an excellent approximation. In that case you have NLP. Yes? >>: The regional norm if it's a [inaudible] means that you'll get the optimum on one of the edges, right? >> Ben Recht: Yes. >>: And this will gain you the sparsity and these nice properties that you are interested in. Once you've found it usually it's going to be kind of a smooth body because this is easier to work [inaudible] but then you lose exactly the property that you wanted to achieve to begin with. >> Ben Recht: Yes. >>: [inaudible] [multiple speakers]. >>: It's like solving integer programming where by… >> Ben Recht: Oh yeah. >>: And then you round, which never worked well… >>: In terms you lose the whole thing. >> Ben Recht: We don't, it doesn't work. Okay there are two different, yes, correct. >>: That's what I was very confused by. >>: Because what you're saying is that you can always take the bounding [inaudible]. >>: Yep. And then you're back to L2. You are solving [inaudible] again that seems kind of bad. >>: You lose all of the structure. >> Ben Recht: You lose all of the structure. So here's what I can tell you and this is actually kind of weird. So you are right, that this seems to be erasing. This picture sucks. >>: No, it's actually good, because… >> Ben Recht: Well… >>: [inaudible]. >> Ben Recht: But, but… >>: This is a problem, you know, it sucks for you because… [laughter]. >> Ben Recht: Let me tell you what actually happened because this is fascinating to me. If you, so let's say this [inaudible] problem is the one that I've looked at. You start with your thing and you have your A thing that, this [inaudible] operator is NP hard to compute. We do this semidefinite relaxation on this A thing. You show that the number of measurements required--and it turns out first of all of the vertices are still exposed, one. Number two, to get exact recovery the number of measurements blows up by a factor of two. But your computation time goes from exponential to reasonable. That's wild. That is totally crazy. >>: [inaudible]. >> Ben Recht: Huh? >>: [inaudible]. >> Ben Recht: Well, we do, I do another one of these factorization hacks and then I run some gradient descent and then if you want, we can solve very quickly. >>: [inaudible]. >>: I'm still confused because… >>: [inaudible] the way to phrase it is to begin with you said my problem is that I have too many free variables. I have too few measurements so I have to make some structural assumptions. >> Ben Recht: Exactly. >>: Now what you say is look if I make this [inaudible] assumption [inaudible]. >> Ben Recht: That's correct. >>: Maybe in fact I can't compute this norm, right? >> Ben Recht: Right. >>: So maybe it falls down to where I can't compute this norm. >> Ben Recht: So I add more hypotheses… >>: Instead you say okay, I will bound it. But actually you can take this, roll it back and say okay, I'm simply making a different structural assumption. >>: A weaker one. >>: It's a different one, doesn't matter, right? Then, and this is actually what you're doing. Instead of making say a sparsity assumption, right, a true sparsity assumption is something that we can work with because as zero. The… >> Ben Recht: No, no, we do it, I mean I agree with you 100%. I don't think that the sparsity versus L1 is actually the, exactly what you said. It's basically another way to say it is we start with a set of atoms and candidate hypotheses, I note now know that the thing, I can't solve it, so I add more hypotheses. >>: Right. >> Ben Recht: Okay, fine, I know that they are not there… >>: So you can say one way to put it is yeah, I compute the known differently, but actually you can roll it back and make it more explicit by saying actually this is my structural assumption. My structural assumption is now different. It's not that… >>: Use A2 instead of A1. >>: Yeah. >> Ben Recht: Yeah. >>: [inaudible]. >> Ben Recht: Yeah. >>: But then I still don't know how to go A2 to A1, if you really care about A1. >> Ben Recht: But let's say, the reason I don't like it… >>: [inaudible] arbitrary. A1 was you just wanted sparsity, right, because yeah, it looks fine. You know that you need to make some structural assumption so maybe this structural assumption is hard to work with. Here is a different one. >>: Yeah, which is a superset. >> Ben Recht: A superset yeah. >>: In other words… >>: It still happens to be superset. >> Ben Recht: Yeah, it could be; a subset would be easier, too, if you get lucky, a subset could be easier. Like for example, I could say like it's this one [laughter] that's a better assumption. >>: You know…. >> Ben Recht: No, really that's actually really good because you are right. It's changing basically what this, okay, I like that. What this hierarchy is doing is changing the structural assumption. The only reason why I don't want to go completely in on it is because I never understand what, I mean I very rarely understand what the things that come out of this machinery actually are. What structural assumption do I have? I don't know. >>: But this is kind of, you know, when machine learning became popular to have regularization terms which is kind of an arbitrary thing to, I mean we don't understand what it does; it just happens to be working, but if you are capable of rolling it back and explicitly saying that this is my assumption, this is very powerful, because now I look at my problem and I can say does this assumption look reasonable toward the problem that I am having or is it just hides in some unknown bound that is just a guess? >> Ben Recht: No, no. So the thing we can characterize how much the, I mean oftentimes we can characterize how these change. >>: [inaudible] do you want [inaudible]? >> Ben Recht: Do I want my million dollars? [laughter]. They gave it away, man. I know, I didn't quite get there. I see. >>: I'm sorry, so I'm still a little bit lost. >> Ben Recht: Me to now. >>: [inaudible] huge rounding [inaudible]… >> Ben Recht: No, no this is the point. What's your name? >>: Ronnie. >> Ben Recht: Ronnie, so like what Ronnie was saying is quite correct. We are just changing the structural assumption. You don't have to round. >>: [inaudible] bodies or anything you just throw away. >>: [inaudible] to say where it's connected, you just something you say there is some relation [inaudible] so one is strictly bigger than the other one is strictly smaller, or sometimes that we can even give a constant bound on the difference which is fine but actually you don't need that. >> Ben Recht: But the thing that I can guarantee… >>: [inaudible] the problem is that you can't. >> Ben Recht: But Ronnie the one thing that I will say is that the thing that is nice about the theta bodies is that they are guaranteed to keep the things on the outside of the--so it's not like a bad thing to be. A bad thing would be, I'm trying to think of a good example. You can imagine you have, I don't know if I'll be able to draw this picture. Yeah, okay, I can. Okay, we are on the unit sphere and we take the, this is not quite nice. No, that's not quite it. >>: [inaudible] portal [inaudible]. >> Ben Recht: Yeah, it's just that the nice thing about these things is that you're guaranteed that your, the points you start with, the edges you started are still pointing at your extreme points. >>: Yeah but, yeah, it's nice in cases where you can do that, but actually nothing--for example, in the random walks when you run the walks sometimes these points are [inaudible] points to work with because exact, they tend to be narrow and stuff like that, so sometimes they take into the section of this shape [inaudible] which kind of makes it isotropic and stuff and we can say that for one reason or another I want to chop off these edges. Why? Because. But since the choice of the structure is arbitrary, why low rank. Yeah, it could have been anything. When you make a low rank assumption is just because it seems nice and we hope that we can work with it, but there is no underlying truth of nature that says that… >> Ben Recht: Uh oh [laughter] we were having this conversation earlier we were having this conversation earlier. >>: But that's the wrong example. [inaudible]. >>: Because [inaudible] is compute one last. Here the [inaudible] is not something [inaudible] you think about. >>: [inaudible] to see if A2, so actually I say I want to work with A1. I can't because of computational constraints, so I work with A2. >>: So you are throwing away A1 entire [inaudible]. That's the only question. Are we throwing away elementary [inaudible] incomprehensible gulf that happens when you… >> Ben Recht: I have it on the next slide. Sh**. >>: Described it to me in simple--I am just a poor CS person. What's going on? >> Ben Recht: Come on. I'm on to you. This is the problem. I am on to you. Where is my-there we go. Here's how it works. >>: Okay. >> Ben Recht: It's actually a cute trick. So let's actually go into what the details are. We have to assume something a little bit wacky. It's not that wacky. We assume that A is algebraic. All of the examples I gave actually are. And what is an algebraic variety? It's a fancy way of saying it's the zero set of some set of polynomials. >>: [inaudible] polynomials? >> Ben Recht: Yes. So f live in something called ideal but all I have to do is think of it as is that it's the zero set of some polynomials. For example, if I want to look at the, if I want to look at the set of all sparsity one unit norm vectors, they are the guys, they are the vanishing points, you know, you could construct this polynomial X1, X2, X3, X4, X5, X6, X7, X8, Xd, okay? There are the zeros to that. The zeros of X1, the sum of the squares equals one. That's the variety that gives rise to these sparsity one guys. And you could do the same thing for all of these models. That's really not the important part. But that's the thing that allows us to do computation. The important part is this. The dual norm, look if we can compute the primal norm then we can compute the dual norm and this is kind of important. So if you compute that shrinkage operator for the primal norm, that's actually just projection onto the unit norm ball and the dual norm. That's, those two things are completely the same. So you just have to be able to do one or the other. So let's look at the dual norm. The dual norm has a cool form. In the first case you had the blowup the convex hull. The second case we just take an atom. We take [inaudible] and we take his max and it's too bad that Over had to leave. Oh, look he's back [laughter]. It looks like a Gaussian complexity. The dual, if I put Gaussian noise in here this dual norm is now just the Gaussian complexity of the set A. right so that this is the thing that we know. So now tao, we want to say when is this thing less than tao. So that's true if and only if v [inaudible] with A is equal to tao minus q of A where q of A is positive everywhere on the set A. I mean that make sense. Sure, that is less than. What is this function q of A, well I'm going to decompose it. I'm going to pick a form for q. I'm actually going to pick a form so this is basically going to equate, this is going to pick a form for this linear function. I have a linear function and I'm going to parameterize it. I am going to parameterize it in two parts. The first part is h which is positive everywhere on this, for all X everywhere, so it has to be a positive function. The second one I am just going to take g to be something in this set of polynomials, or actually the linear span of that set of polynomials and products of all things times that set of polynomials. I can do a lot of wacky things. I just want g to be a polynomial that annihilates all of the atoms, so g has to be zero on all of the atoms and h has to be positive everywhere; there is a parameterization. It turns out for algebraic things this is a tight parameterization and you could always find, this is actually if and only if for algebraic varieties. In general, this is a nice way to do an approximation though and this tells you how to make more approximations. I just kind of have to come up with a good parameterization for linear functions in terms of positive functions and things that vanish on the set of atoms. And so it's really designed to be very catered to the set of atoms that you started with. So the relaxation, again the theta bodies come from when it says saying that h has to be positive function, we say that h has to be a sum of squares of polynomials, and it turns out once this is a sum of squares of polynomials and g is in the ideal, the whole thing, the whole thing of computing the dual norm is now a semidefinite program; or this approximation is a semi-definite program. That's the mumbo-jumbo, but it is kind of a concrete mumbo-jumbo that we can solve sometimes, and I know this guy [laughter] we have tricks for doing these STPs and that's basically how we do that and it always gives us a lower bound and yes, there is a nice survey of these things by, actually Rake [phonetic] is just across the bridge so you can go tell her to come over here and we can talk about theta bodies at some point because she would do a better job than me. So that's kind of one way to make these approximations. >>: So you make a nest of STPs that hopefully get you closer and closer and closer. >> Ben Recht: Yeah, and it works, I mean for the problems where we have been able to do it, so I mean it's not always doable, but it's doable for that cut norm problem, for these cut majors problems and it works well. It turns out that it ends up working better than the nuclear norm on a lot of different problems that we've tried, so we actually have some reason to think the you are looking for, sums of clusters or things with bounded dynamic range that maximum is a better thing to use and all you have to do to change the code, that blog heuristic is instead of doing a step, all you basically do is you take a step along the gradient just the same way as we did before in the L&R and then you guarantee that their norm is less than some bound. You just squish their norm and that's it, so a very minor adjustment. >>: I have a single question. So this reminds me that sometimes [inaudible] proximal range [inaudible] with respect to A is hard to compute. It could be the case of the dual norm on the A is very simple. So… >> Ben Recht: Well, if the dual norm is easy then the proximal norm is easy. I have that one here. Wait, wait. Hold on a second. That should be right above this. Oh, it's not there. Man, sometimes you want these slides to be there and then they just disappear. Anyway, the dual norm and the proximal--so we look at the solution to one and the solution to the other and you know what, rather than try to; I am just going to open another presentation where I know where it is. Come on. IPM, there, shoooo, yeah, there we go. Okay? If you have the proximal operator… >>: Yeah, I don't mean [inaudible], that's why I've-possibly we can avoid the proximal [inaudible] and use [inaudible] radio waves to solve it. Basically I compute the gradient and so I take the minus gradient and I take the item A that's maximize the inner product with the [inaudible]. >> Ben Recht: Yeah, right. See, there are two problems with, I mean greedy algorithms are actually a great way to go and actually this atomic stuff is… >>: [inaudible] become more of a problem [inaudible] proximal [inaudible] sometimes A is very hard. >> Ben Recht: There are two things that are troublesome about the greedy methods. One is that you can get the sparsity bounds with them. Oftentimes, and it works for L1, but oftentimes even for nuclear norm, the greedy methods give you full decompositions unless you are very careful. You have to do some funky stuff. The second thing is that the greedy methods are often no easier than computing the proximal point method, and they amount to doing the same thing. If you look at a dual norm you are finding the maximum kind, you have to find the maximum atom that is correlated with some vector. That's the same thing that you have to do, if you can do that, you can solve a greedy algorithm. You can also solve the dual norm projection. Either way, you have the same complexity. Anyway, so this is… Anyway, we are out of time guys. Well thank you, thanks everybody for paying attention. [applause].