Document 17857478

advertisement
>> John Platt: It's my pleasure to introduce Ben Recht. As you can see from the slide, he is a
professor at the University of Wisconsin at Madison studying one of the possible computer
sciences. I suspect that is data science. He's nodding, okay. Previously he was a postdoc at
Caltech and before that he got his PhD from the Media Lab at MIT and so here's Ben.
>> Ben Recht: Thanks John. It's great to be here today. I did tend to pick this title to try to
maybe have it be as mathematically intimidating as possible, but I think actually the more
appropriate one for today as we don't have to be as kind of lofty as the original title suggested.
I just want to say let's give some basic ideas of how we might make predictions when we are
short on information, and what I mean by this is I want to talk about a very classical canonical
problem that were always faced with, which is we collect a ton of data. We always do this and
we would like to make some kind of an inference about the data and even though we have
gigabytes and gigabytes or terabytes of data, we still have it being, we still have a lot of missing
data and a lot of missing information. Somehow we would like to make inferences about what
is not there. So the idea today is how to incorporate structure that we know, what is the
appropriate way, or an appropriate way to incorporate structure that we know into these
problems to make them well post and such that we can solve them efficiently. So let me start
with thanking my collaborators. As you'll see there is a bit of a dichotomy here. I have a lot of
folks at Wisconsin who I'm working with on theory and, and I have one person who I think a lot
of people in the room know who I am working with on algorithms and I guess we just need one
level. Really once you get Chris on a project you don't need more than one algorithm person.
He's just kind of a machine, so that's good. So let me begin with a motivating example that is
very dear to my heart, and that is a recommender system since the problem I've been thinking
about for quite some time, notably because Amazon won't let me stop thinking about it. Every
time I go to that website they tried to make me buy stuff based on my click records and of
course when I really got interested in this problem was when Netflix offered a lot of money to
improve their recommendation system, notably this was a snapshot I took of my page many,
many years ago, where they recommended this movie Stalker. It's a Tarkovsky movie. Is it
Tarkovsky? I can't remember. It's a Russian movie where basically people walk around an
abandoned quarry and throw a rock and yeah, yeah, and so apparently based on liking these
other three movies, this was the logical recommendation and the question is how did Netflix
make that decision. My wife who is an art and film historian suggested that if you take these
movies, these three movies, you put them into a blender and then you skim the plot off the top
[laughter], and you get Stalker at the end; that makes sense. Of course, the other place where
we have these things all over the place is on internet dating sites, which I've become much
more of a believer in since my sister found a very nice husband [laughter] using this service.
Now I actually do believe that this stuff works. So in all of these cases and all of these web
services, we can only capture a small fraction of the information that we need about all of the
users. So the question really becomes how are we actually going to make good quality
predictions to keep people coming back. And of course as I mentioned Netflix tried and that's
how I got interested in the first place even though I didn't do very good at getting this million
dollars, they did offer a lot of money and they got a lot of these machine learning folks thinking
about how should we actually go and design large scalable recommendation systems. So these
are the seven guys who won. I think everybody knows these guys by now. The brilliant idea
here is that they got the entire community thinking and managed to have to pay seven guys a
million dollars for 3 1/2 years of work. I think Netflix wins, [laughter] regardless of what the
outcome was. So again, just a reminder about how this worked, they gave out 100 million
movies. They gave them to you on a scale of 1 to 5. And you had to predict with some high
accuracy the number of movies with 20,000, the number of users by 100,000 and if I just
multiply those numbers together I get about 8 billion. So I have a matrix that should have
about 8 billion entries and then I get to see 100 million and now the question is just what do I
do. How do I fill in those missing entries? So the abstract version in the way that we kind of
look at the, I mean this is a very common way of abstracting this problem, is to say that we
want to complete a matrix. Matrix completion is kind of a classic numerical analysis problem
that's been around in lots of different forms for a long time and it's just saying that if I have
some partial information, maybe filled in in black here in some matrix, how do I infer the white?
So obviously we need some structure here, right, because you could just put anything in there
and it'll be a valid completion. In the case of the Netflix problem, a reasonable kind of
completion would be a low ranked one, because we all believe that the first thing to try if we
have a full data matrix, if we want to make a good predictor, is we’re going to run some PCA, so
if we have a missing, if we have a lot of missing entries, maybe also it would be a good idea to
try to run a PCA even though the entries are missing. The reason why this is particularly
relevant in the case we have missing data is if we just do parameter counting, the X matrix has k
times n entries which is that 8 billion I showed, but the factor matrix has r times k plus n
entries, which is dramatically smaller. So we have a huge reduction from k times n to r times k
plus n, from 8 billion to maybe tens of millions of parameters. If we have 100 million ratings,
now the problem almost looks over determined and we just have to figure out how to we
actually infer a low ranked matrix. Of course lowering matrices are just everywhere. They're
not just in these data matrices on the internet. Rank is a really useful and powerful way of kind
of summarizing what is simple about some model or system. For example, if I want to do some
sensor network embedding problem, if I look at the matrix of all of the inner products of that
between points, in the grand matrix, that will have dimension equal to either two dimensions, if
they are lying on the floor, or three dimensions if they are in 3-D space. And in multitask
learning where--this has actually been a very popular way of saying that classifiers for let's say
the same digit written by different people should somehow have a low rank structure as well,
and this actually has been very powerful in trying to train lots of SVMs at once. Where I kind of
got into this problem to begin with, I mean I always say it's about the Netflix problem, but really
it's because I was hanging out at Caltech with all these controls guys and in controls, rank is
everywhere, because it's kind of the way of summarizing the state of a system, so essentially
this special matrix that's called a Hankle matrix, the rank of that matrix tells you how many
numbers I need to predict the future given the past. I could throw away all my past
measurements, just keep a number of parameters equal to the rank of that matrix. So these
are all nice ways of summarizing simplicity and of course, it would be nice to be able to find the
lowest rank solution of some kind of set, and so this is kind of our canonical inverse problem.
We have ax equals b. I am going to tell you that x has low rank and now we just want to solve
that. So this problem has kind of the simplest version of the rank minimization problem is
actually very hard; it's NP hard, and if we talk to different parts of, well it depends on who you
ask. If you talk to the computer size theory people, that might be a good enough excuse to go
home, but because not only is it hard, but it's also hard to approximate. There is never any
hope of doing any approximation algorithms, but of course a million dollars is on the table and
someone tells you the problem is NP hard, would are you going to do? What we're just going
to try something anyway, going to try some heuristic. Yes?
>>: [inaudible] approximation [inaudible]?
>> Ben Recht: I think it's logarithmic, is the best you can hope; logarithmic in the dimension of
the matrix is the best that you can hope, especially once you add noise. There are lots of
different ways to get at it because you're basically solving quadratic equations, so that could be
good and that could be bad, but I think that's the best guess. Moreover, you actually don't care
about estimating the rank exactly; you actually care about that decision variable. So what is a
good heuristic? What is a reasonable heuristic? So I'm proposing that this thing is a reasonable
heuristic which is what I've written here, which is to minimize the norm called the nuclear
norm. I'll tell you a little bit more about what the nuclear norm is in a second, subject to the
data. Now I say that that's reasonable because if we squint a little bit and again I'm going to
come back to it, here is the algorithm for solving that problem. What we do is we factor the
matrix x as L times R and I just pick some initial starting point, and then I pick one of the entries
that's been given to me in the Netflix prize and I compute this residual, e which is equal to L
times R minus Muv. Then I'm just going to update my factors according to this rule for that one
model and then repeat, so essentially this is running a stochastic gradient algorithm on this
normalization problem. Basically the idea is that this entry is purely determined by the product
of this row and this column. So when you go and do that it turns out that you can get a 6 1/2%
improvement on the Netflix prize problem and the guys who won ended up using this is one of
the baseline algorithms that they combined into their mega-classifier. Actually this gradient
descent algorithm, when it appeared wasn't actually called minimizing the nuclear norm. It was
just called the SVD heuristic, and it was the one that appeared on this fellow Simon Funk’s
LiveJournal page I think about three months after the thing started. So it turned out that he
actually was minimizing the new kernel of the matrix with this data. Now the question is why is
that a reasonable thing to do. Yes?
>>: [inaudible] letters flying around.
>> Ben Recht: Yeah, okay, here.
>>: So there is X is equal to X is equal to L times R transpose?
>> Ben Recht: Uh-huh. Oh, sorry did I leave that off of here? Yes. X is equal to L times R
transpose. The thing you're searching for is this guy. We basically made an approximation here
that actually gets people to L times R transpose. Once you do that than actually saying X is
equal to B is just saying that, so a measurement of X is equal to B is the same as saying we want
the entry here to correspond to one of the entries over here, so it's just filling…
>>: [inaudible] is like the mask.
>> Ben Recht: Yes, it's just saying I am going to give you one of the entries. Correct.
>>: [inaudible] U and B.
>> Ben Recht: Where is U and B? Oh, it's just a component.
>>: [inaudible] subscript?
>> Ben Recht: A subscript, yeah.
>>: It's like the one that is the mask. The [inaudible].
>> Ben Recht: Yeah. Those are my, those are my, one of my indices; I just pick one of them.
>>: So X and M are synonymous?
>> Ben Recht: No. M is a true thing. X is the…
>>: M is true, oh.
>> Ben Recht: Yeah.
>>: [inaudible].
>> Ben Recht: Yes. That's right.
>>: [inaudible] estimate of…
>> Ben Recht: That's right.
>>: Okay.
>> Ben Recht: That's right. This will be much clearer what I get to, much more clear when I get
to how you actually, how you actually show how these two things are the same, but I'll get to
that a little bit. I just wanted to…
>>: [inaudible]?
>> Ben Recht: Huh?
>>: Where is the nuclear norm?
>> Ben Recht: Where is the nuclear norm here?
>>: Yeah, because…
>> Ben Recht: Hold that thought. Yeah, hold that thought. We are going to come back to that
[laughter]. That's good. No. I actually have--I'm going to come back to that, so just hold that
thought. Let me just do two more quick examples where I kind of want to just give us a little
more, slightly bigger picture and then we will come back to these examples. So the other--oh,
right. John asked for me to talk maybe a little bit about kind of some of the bio nano something
or other projects I might've been working with, and I did want to mention this one, where with
some collaborators at Stanford we had been looking at trying to detect, trying to find
biomarkers for brain cancer, so just to set this up, brain cancer is obviously a terrible, terrible
disease and it's particularly terrible in terms of cancer because we have no way of detecting it
before you are symptomatic. I mean, you could probably see it on imaging, but typically we
don't have any good ways of understanding how to do prescreening for brain cancer. There's
no sense of like lumps or polyps and so the hope and actually most of the hope in cancer
detection is to say, and in cancer treatment is to say, if we get it early then it is the most
treatable. So I've been working with these doctors and they're actually trying to do this model
in rats to see what they can see in rats early, pre-imaging, pre-symptoms. So they have this
procedure where they treat a pregnant rat with a carcinogen. The offspring are born
completely indistinguishable from wild type offspring and then invariably in this green region,
they start to develop cancer and you can't see it on imaging, but they can see it because they
kill them and then they sliced their brains up like prosciutto and they see these little nests. So
there are polyps there. There are these kinds of pre-cancers in the brain. In this yellow region
you can actually see them on imaging. On the red region they typically die. The question is can
we actually find something without imaging in this green region. The way that they're going to
go about it is they extract spinal fluid and extract blood and they run it through a mass
spectrometer and they get stuff out that looks like this so sort of a canonical bio marker mining
problem.
>>: You do this when they are little and then you wait and see if they eventually die of brain
cancer?
>>: [inaudible] access?
>> Ben Recht: The access is mass overcharge.
>>: Mass overcharge?
>> Ben Recht: Mass overcharge and then amplitude on the y axis, so kind of our classic mass
spectrometry, yeah, yeah.
>>: So what's the experiment? Like they do this to all the little rats…
>> Ben Recht: So this is one of those interesting things when we interact with biologists. We
would like to do time series analysis. The way they do time series analysis is they run multiple
trials and then take samples where they kill the subject [laughter] at some time in that slice. So
invariably when they are doing that spinal fluid thing they also kill it and slice the brain up, so
they do it at multiple slices, so actually this experiment is 50 rats. It turns out that rats are
terribly expensive, so that actually cost, actually took a long time and cost a lot of money, but in
the green region, the yellow region and the red region they would select some rat from the
pool, extract the spinal fluid, put them in the MRI and then kill them [laughter].
>>: I see, so if they could look in the green region to see if they had these nests and then you
take that as a positive?
>> Ben Recht: That is correct. But what's crazy about their model and this is actually one of
these interesting things about biologists. In rats, which are kind of genetically diverse as
compared to mice; that's kind of another thing you learned. Mice are all exactly the same.
They are just like clones. Rats are a little bit more genetically diverse. Invariably this treatment
that they give the mother causes cancer. I mean if they give those rats, those rats will…
>>: So they are…
>> Ben Recht: It's almost 100, hundred percent positive. If they've been treated, they get
cancer.
>>: They can figure out how to give cancer, but they don't know how…
>> Ben Recht: They're very good at giving cancer. They are very bad at treating it [laughter].
>>: So it's deterministic. It's not like you have to worry about what will happen. They will get it
or they won't get it.
>> Ben Recht: That's right, and that's why, that's why they are so fond of the model. I think this
also happens in medical research all the time. If you could find something that is all
deterministic, you kind of latch onto it because so many other things are variable, including
these mass spec lines.
>>: So they just too massive [inaudible] quit?
>> Ben Recht: Yeah.
>>: And I don't do [inaudible] trying to find the [inaudible]?
>> Ben Recht: Well the problem is how do we do, which piece do we look at, is kind of the
problem that they…
>>: [inaudible].
>> Ben Recht: Huh?
>>: [inaudible] died…
>> Ben Recht: Yes, exactly. But when you have, when they are doing this cell mass
spectrometry and then the question just becomes which of these do I actually, which ones am I
going to extract, because the other problem is that not only are the rats expensive, but lab
techs are also [laughter] somewhat expensive for these guys, but you are right. I think that the
question is you could try to go after all of the peaks, but really the problem here is that the
number of peaks is way bigger than the number of rats we are ever going to have, so you're
going to have hundreds of peaks and 50 rats took 3 1/2 years, which is crazy. The whole thing,
when I say these are days. I'm sorry, I should've said that. These are days, so that's three
months to grow a rat to, six months to grow a rat to death, so these things take a while and you
can have too many of them and you have to take care of them. And so what are we going to
do? We have, we are in again the same situation that we were in before where we have far
fewer examples than we actually have parameters, unless, unless, only a couple of those peaks
really matter. So what we actually ended up looking at was kind of the standard at one
minimization for actually pulling out the markers. So I think most people at this point of the
game have seen that that seems to be a popular way to extract sparse signals when you have
some data and, whoops. What happened here? Sorry. And of course there has been all of this
hullabaloo about compressed sensing for a similar reason, a very related reason, is that what
we actually want to acquire data, perhaps we can actually use the fact that the model is sparse
to reduce the number of measurements that we actually want to take. Yeah?
>>: So you said you also have a control of healthy rats?
>> Ben Recht: Yes. I'm sorry, so you have two, yeah. The control group takes just as long to
raise unfortunately [laughter], so there's really not a better situation, so you do the exact same
experiment.
>>: You can't reuse them.
>>: You can't reuse them?
>> Ben Recht: You can't reuse them and it is actually, it's kind of the worst part is everything
you do to these rats does kill them, so you can't even do, the next day they are doing
longitudinal studies. They're just hoping that everything happens the same. You either kill it at
30 or 60 or 90 or 120.
>>: They can't do an MRI or something?
>> Ben Recht: No, they do do MRI, but then if you want to do histology which is this
prosciottoizing of the brain, there is not much you can do. Isn't that what it is? [laughter].
>>: Neat verb.
>> Ben Recht: [laughter], sorry. I'm getting off track. Let me just jump through this really
quickly. So again this is the same problem we had before. We want to solve, even if we want
to solve a linear system and find the sparsest solution, that's hard. That's this cardinality
minimization problem and that's hard. But now we actually have a whole field dedicated to
minimizing the L1 norm subject to some constraints and it actually has now this name for it,
compressed sensing. So we have two examples where we pick a norm and it seems to do really
well on several different problems. Of course, and I'm going to skip that last slide because you
certainly don't need to know this, machine learning is kind of the same thing, right? Machine
learning is the same thing. We have a, in the discriminative version of machine learning we
have some set of features and then we have some kind of model that we want to fit and we're
going to fit some fitness function which could again be a least squared or a linear system, and
then we have to pick some function and we have to pick some set of functions to optimize over,
and we are kind of in the same situation that we were in before, that the number of samples, or
the sample complexity is somehow dictated by a smoothness parameter of this underlying
space of functions that we are trying to operate on, and if this space of functions or the
function that we are trying to find in our hypothesis space is smooth, then we are going to be
able to find it. So again, there are a lot of numbers here, n would be your number of samples.
D would be the dimension of the thing that we are looking for and S would be some parameter
which is called the smoothness parameter. If that smoothness parameter is large enough we
will actually need substantially less samples then the [inaudible] cardinality would indicate. So
this is kind of one of those classic arguments that we get in neural networks, kind of the same
situation that we been in before. In all these cases, I'm going by that fast, so we don't dwell on
it to quick, because I know Leon's going to bug me. In all these cases we have [laughter]
complex systems. They generate, we want to generate some predictions and the only way to
make some sense of that is to leverage some structure. So here is the problem, I'm really going
to focus down on this one simple problem, for the rest of today. I want to find a solution of Y
equals five times X and this is under determined, so M is less than N and let's just assume to
make ourselves happy that there is a solution. There's one solution. So once there is one
solution, we have an infinite number and if there is an infinite number, which one do we pick?
So as I said, what we want to do is leverage some notion, some structure, some kind of baseline
structure and that could either be sparsity. That could be rank. That could be the smoothness.
That could be some kind of symmetry. I'll tell you a few more examples of things that we can
leverage and the question is is there a way to just do this with a crank? Meaning that you tell
me which of these boxes I have, which structure is present and then I'm just going to give you a
reasonable algorithm, some prototype algorithm that maybe we could just solve right off the
shelf without having to think too hard, and where I can actually predict for you what your
sample complexity should be and where I can give you bounds on the error. So everybody is
okay with at least that set up, rather than these motivating examples? So let me just go
through and do this with cartoons. All of these previous results that I kind of went through in a
haphazard fashion more derived. So the first one is just where does this L1 norm stuff come
from? So here are our one sparse vectors of Euclidean norm one. If we draw in the convex hull
we get a norm. It's the unit ball of the L1 norm. And if I think about what does it mean to
minimize the L1 norm subject to some equations, well, basically I have my space phi X equals Y,
that's my set of equations. And if I want the minimum L1 norm solution what I will do, is I will
take the L1 ball and I will inflated until it hits. Lo and behold it hits on one of these corners,
which happens to be a sparse solution in this case. And this picture is pretty much directly
stolen from the first compressed sensing paper by Candes, Romberg and Tao. This is exactly
the motivation for why that should be a reasonable model to use to find a sparse solution from
some math find set. Now rank, now I'm going to get back to [inaudible]. Here are my 2 x 2
matrices, plotted in 3-D. So again I can only do these little cartoon so again, this is about the
biggest matrix problem we can look at with a really good picture. So we have xy, yz. The set of
all rank one matrices that have unit Euclidean norm are these two circles. If I shade that guy in,
I shade in this convex hull, that turns out to also be the unit norm of, unit ball of a norm. In this
case it is the sum of the singular values of the matrix. And again, we might expect just by the
same shape if we we're going to go and try to minimize this norm which is called the nuclear
norm, subject to some linear constraints that we'll hit on the boundary of this ball somewhere,
just some dilated copy of the ball, those boundaries are the places where we have low rank
solutions. This was kind of the basis for how we analyzed this in our papers on matrix
completion and rank minimization. Again, it's kind of the same picture in that case. Now to
just, I could've jumped ahead, but I'm just going to do it right now, to say how does this relate
to that stochastic gradient heuristic for solving the Netflix prize problem? Let's just do a quick
run through again of this type of problem. I have the nuclear norm of X is equal to the sum of
the singular values of this matrix, and then I'm going to have just my phi X equals Y as my
generic linear inverse problem. Now let me parameterize X as L times R. If X is a singular value
decomposition, U sigma V star, I'm going to pick a particular factorization, U sigma to one half,
V sigma to one half. That is a low rank parameterization of the true solution, or of the decision
variable. Let me plug that back in, plug-in L and R where X was and it turns out you're left with
the Euclidean norm of L plus the Euclidean norm of R squared, because U, again going back to
our linear algebra days, this thing is a orthogonal matrix times a diagonal matrix times an
orthogonal matrix, so if I look at the sum of the squares of L, that's just the sum of the squares
of Sigma to the one half, which is actually just the sum of Sigma, which are the sum of the
singular values, same for the R. So now it's much simpler and then if I just take this constraint
and add a Euclidean, sorry a Lagrangian penalty here, we get kind of your standard regularized
version. We have a sum of squares. We have a penalty in L2 and now that's nice and smooth
and you can run gradient descent on it.
>>: So are you adding [inaudible] both L and R for stability reasons? Because both of them
yield the same thing.
>> Ben Recht: In the cost function? Yeah, as opposed to just letting R be free? Good question,
that's a good question, because it would be symmetric, because they are equal at the optimal
value, so maybe you just have to penalize L. I haven't tried it. It would be worth, you just take
one out and just see if you can only regularize one of the factors. But we just do it for
symmetry really to not preference one or the other. So this heuristic here actually was doing,
this thing that was the successful and common sense thing to do is actually solving this nuclear
norm problem.
>>: Without realizing it.
>> Ben Recht: Oh yeah, and we didn't realize it either for quite some time [laughter]. Hindsight
always being 2020, of course, they were always doing the right thing, very principled way of
solving these things. I guess that's kind of the point of what I want to get to is you do the right
thing most of the time without realizing it. Is there a way to actually, if you have something
that you don't know how to do, have this crank to turn. Yes?
>>: [inaudible] question.
>> Ben Recht: Yeah, go, go.
>>: Multiplier usually has…
>> Ben Recht: Lagrangian terms.
>>: Well, there is the Lagrangian term and then there's a stabilizer term. You've written down
a stabilizer term.
>> Ben Recht: Yeah.
>>: This is the penalty method; [inaudible] this is really what they are minimizing. It's not,
[inaudible].
>> Ben Recht: Yeah, that's fair, that's fair, that's fair.
>>: So there's no lambda times and then [inaudible] squared menu you leave wiggle lambda
around with a…
>> Ben Recht: It actually works better if you put that on there.
>>: It works better?
>> Ben Recht: If you add the Lagrangian term. I mean, you are right. You could have a
Lagrangian multiplier.
>>: Yes.
>> Ben Recht: And that does work better.
>>: So this is just the penalty method?
>> Ben Recht: They are doing the penalty method. We actually, our code actually uses the
Lagrangian.
>>: Yeah, okay.
>> Ben Recht: That is fair. So let me give another example. I could keep you in these examples
all day. What happens if I want to solve an integer programming problem, but now it's a weird
engineering programming problem? I'm just saying I have an inverse problem I'd like to solve.
It has plus or--I know the solution is all pluses or minus ones. So maybe I can multi-knapsack
type problem. In that case, again, this is the right corners here are the integer solutions. I
shave the convex hull. I get a ball, the unit norm. The unit norm ball of the infinity norm in this
case, and again, I have the exact same picture as before. Blow that thing up and it hits the
affine subspace, on the corner in this case. Again, you analyze again this structure and you get
the results of Donoho and Tanner and Mangasarian and myself. We were analyzing this for the
case of the multi-knapsack problem. So in all of these cases what do we have? We have a
model, this X thing that we like to fit. We decided that we have some reasonable notion of
simple models which we would like to construct X as a short some of those simple models. The
goal is just how do we actually find the shortest decomposition? In all three of these cases now
it might not have been clear what we were doing was minimizing the sum of the absolute value
of the coefficients subject to this decomposition holding, because that's all it means to be
blowing up the convex hull. You're just trying to find the smallest sum of coefficients that
touches the affine space that we are trying to search through. Now the only difference
between, this looks like it is just the L1 norm, but the only difference is this set of atoms can be
not a discrete set. In the case of the new unit ball the set of atoms were those circles. This is
manifold or a union of manifolds. And really these items can be anything. Let me just give you
a couple of more examples where these atoms might not be quite as clearly just a basis. So for
example, union of subspace models have been very popular in a variety of contexts lately. In
machine learning they've been mostly for like filling in image patches where you have some
kind of hierarchical structure that you would like to build upon. Francis Bach group has been
doing a lot of stuff on that. In Cobb’s weightless theory you kind of have these subspaces
where you know that basically if one of the leaves is active in a wavelet tree, the whole path up,
all of the ascendants also should be active in some kind of wavelet decomposition. In multipath
problems where you are trying to do de-noising, basically you have some known signal and
then you have copies of that known signal and you have just a union of a blot of onedimensional subspaces. So in this case we just have a bunch of subspaces. The atoms are the
unit balls in each of the subspaces, take their union, and in this case it is not a discrete set, but
if you look at the recent work by Francis Bach, the algorithm that they use does end up actually
corresponding to that atomic norm I wrote on the previous page where you just do this blowing
up of the convex whole.
>>: This is [inaudible].
>> Ben Recht: Not that stuff, not that stuff, as people have noticed, there are 100 algorithms
for union of subspaces. The one that we are talking about--oh, can I write on the board? Okay,
good, the one--maybe everybody will be able to see over here. The one that we are talking
about minimizes, you say X is equal to the sum of VG where G is living in one of the subspaces,
summed over the subspaces and you take the if of the some of the absolute values, I'm sorry,
the L2 norms of these guys subject to that equation holding, and this is the norm of X. This was
proposed by Bach, Obozinski and Bahr I think was the paper that did this one. Francis has like
eight different ways of doing union of subspaces. That's the one that corresponds to the one
that we studied. I'll tell you later this is the one that we can actually analyze really well, so we
get substantially better balance than what you can get without, well than what I've seen with
all of the other models, for what it's worth. I did flash this one up. This has been more and
more important since I moved to Wisconsin. Everybody wants to talk about how their football
team is better than everybody else's football team, and so in this case we have some kind of
measurements, some permutation matrices and would like to find maybe good mixtures of
rankings. And in this case the models were the atoms would be permutation matrices and if I
look at the convex hull of those guys, it's something called the Birkhoff polytope, which you can
optimize over each efficiently and actually more efficiently than treating each permutation
matrix as a basis element. And then, I mean I can keep going. There are lots of other examples
whether they be moment problems or problems involving matrices were problems involving
tensors. In all these cases we can follow our nose and see that we can solve these atomic norm
problems but they are not L1 minimization. In the case of moment problems which come up in
system identification and numerical integration and a lot of other things, you end up with semi
definite programs. In the case of the tensor stuff you're going to get weird alternating least
square things. Everything is hard in tensors. In cut matrices we end up having for people who
are familiar with Nati Srebro’s work, you end up with something called the max norm, which is
an approximation to the norm that is induced by cut matrices is quite hard. The cut matrix is
simply a matrix that is ranked one and all plus or minus one, so you can't actually optimize over
that efficiently, but you can approximate it very well using something called the max norm. So
all of these cases we actually have to use different technology, but we can do it and again we
have a crank that we follow once we understand that these are the simple models and then we
want to minimize this convex hull norm. Indeed we have this funny name for it called atomic
norms because I spent too many years working on nuclear norms and I figured this would be
the next one. Then we will have molecular norms will be after that, right [laughter]. And then
we go on from there.
>>: Then we'll go down to and quarks and…
>> Ben Recht: Well, that would be the other way, so we were at nucleus so we can go the
opposite way too.
>>: That would be great.
>> Ben Recht: The atomic norm is just the thing that I've been telling you the entire time and
it's called X of A, which is the same notion I have on the board here. I have a basic set of atoms,
and what I want to do is define this function norm of X of A, is just going to be the smallest t
such that if I blow it up, it hits X. If I blow up the convex hull it hits X; that's the atomic norm.
It's also called the gauge function for people who had to suffer through some convex analysis or
functional analysis; that's where you may have seen it. If A has a couple of nice properties
including that its fold dimensional and center symmetric you end up with norm, which is this L1
norm looking thing. Again, this A set could be infinite though. So here is my prototype
algorithm, at least the thing that I've been promising you for far too long. The prototype
algorithm is minimizing this atomic norm subject to the data that I've been given. That seems
reasonable, and the question is when does this work, and the second question is how do we
solve it? And these are two reasonable things to get it. So we have this convex hull norm. It's
just, and all you have to think of in your head is that we have that same picture as before. We
pick our atoms, we take their convex hull. Would blow up the convex hull until it hits the affine
set and when does that work? Yes.
>>: The constraints you have [inaudible] inequality constraint and so you assume that your
data is perfect or how do you…
>> Ben Recht: For now. We'll get to, we can, the analysis is slightly different when the data is
not perfect, but let's at least start with perfect and then we'll add noise later. The problem is
once you have noise then there are lots of ways to add it, so we'll talk about two different
ways, but let's just start at least with the simple thing where--because even there, it's still not
clear that we can do this, or that it will work.
>>: [inaudible]?
>>: Yeah, I was going to ask that. What [inaudible] metrics?
>> Ben Recht: That if I have an atom, minus that atom is in there.
>>: [inaudible].
>> Ben Recht: So the two things that you need is that the convex hull contains a full
dimensional, like a little small full dimensional Euclidean ball. That's one thing you need to get
a norm, because otherwise we'll be flat and you will have these points with infinite mass so it
has to be able to span the entire space, and the second thing that you need is that if A is in
there then minus A is in there. Most of the time we don't even need that as a norm. We are
really just using the gauge function property. We're just using the blowing up of the convex
hull, but it is a norm and it has this kind of L1 norm looking thing if it's cetrosymmetric.
>>: [inaudible] combination is nice because it, you get the [inaudible].
>> Ben Recht: Right. So if you are happy to just add in the negatives of your atoms, for every
atom, just add in the negative ones, like if you have a permutation matrix, you add in the minus
permutation matrix; you could do that, if that makes sense.
>>: So you proofs are going to be that it is actually a norm?
>> Ben Recht: No. It's only going to use that gauge function property. The proof is only, and I
will show you here again. The proof, what's nice about this analysis is the proof is kind of
universal and pretty simple. So we need to get a little bit of math, but it's not so bad. We're
going to find something called a tangent cone. Imagine this is my ball. This is my convex hull
and X is the point I'm looking for and I'm promised it's out there. What I'm going to do is define
this thing called a tangent cone, which is a fancy word that if you read Terrel Rockefeller's
convex analysis book, that's where you will see it, but really all it is is all of the points, all of the
directions that make the norm smaller and that is a cone that goes off to infinity, so it's all of
the directions that make the norm smaller. From looking at this problem when is X the unique
optimal solution of this minimization problem? Well, it is basically, if I am at X, X is feasible so X
starts off being feasible and then if I want to stay feasible I have to move along the null space of
phi. I have to move in the null space of phi, and basically X is going to be the smallest norm
solution if any direction I move in that null space increases the norm. This is all very
tautological and so basically this is the tautological statement. X is the unique minimizer if the
intersection of the cone with the null space of phi equals zero. So you asking were asking about
whether I need to be in norm, so I don't, right? I actually don't even need to be the atomic
norm. This is true for any function. I define all of the directions that make the function smaller
and then X is the unique minimizer if the intersection with the cone of the null space equals
zero, assuming f would have to be come backs; that would be the only thing that we need. So
it's kind of this tautological question and now it just reduces everything to when does a
subspace intersect some cone at zero? So in order to characterize that, realistically
characterize it somewhat generically we use something called the mean width it, which I think
is a really cool geometric idea that I didn't know about until I started working on this problem.
So we all kind of know what the volume is. It turns out that if I take an object and it is in d
dimensions, and I multiply it by t then the volume is going to increase by t to the d. The mean
width is the same thing except it's the measure that increases by t. If I multiply it by t, it
increases by t. So who knew that existed? The way you define that is you use a little bit of
optimization mumbo-jumbo, so the support function is just a maximum value of d.X where X is
restricted to be our set. So I start with a direction and I just maximize it, and if I look at the
negative direction, it's going to go over here and that thing is the width of the set along the
direction d, so if I project along d that is going to be the width is the difference of this thing plus
its negative. Now let's integrate that thing over the sphere and that's called the mean width
and its conical volume; it's actually what's called an intrinsic volume. I'm going to look at that
cone, and rather than measuring its volume directly, I'm going to measure its mean width and
the mean width is totally going to dictate how many measurements I'm going to need to
identify models. That is due to this theorem… Yes, please?
>>: Can you give some indication as to what kind of shapes will have large or small mean
width? Is it when they are kind of sausage like they will have large mean width and if they are
like isotropic then they will have small mean widths?
>> Ben Recht: No, if they are isotropic they have high mean width because the average of the
width and if they have like a line, they will have a very small mean width.
>>: [inaudible] which one?
>> Ben Recht: We go to average them all.
>>: Average them all?
>> Ben Recht: Yes.
>>: [inaudible]? There is a direction where you always reach the norm. If you go towards the
origin.
>> Ben Recht: That is correct. So how many directions--there is always one direction that you
go towards the origin. That's one direction. How many directions are there? Because what
want to do is basically come up with conditions under which if I pick let's say a random
subspace, because of course we need randomness as a little bit of a crutch here, but I pick
some generic subspace, then what is the probability that that direction back to the origin lies in
that generic subspace? If you only have one direction the probability is going to be in credibly
small, and if you have a very narrow set of directions it's going to be very small too.
>>: [inaudible]?
>> Ben Recht: Huh?
>>: The width along that direction is [inaudible]?
>> Ben Recht: Of the cone?
>>: The cone…
>> Ben Recht: Yes. But will the cone intersect unit sphere? No.
>>: Are you looking at the intersection of the cone and the [inaudible]?
>> Ben Recht: That is what I am going to do. That is what I'm going to do, yes.
>>: One other question?
>> Ben Recht: Yes?
>>: Is there a way to think about [inaudible] easy versus hard when they do [inaudible]?
>> Ben Recht: No. Actually I was going to get to that. It's always hard. It's awful, but we have
good ways to bound it. It's really a nasty interval, but we have good ways to bound it and that
is kind of the heart of what we do. Yes?
>>: I am trying to just get some intuition about this [inaudible] definition of width.
>> Ben Recht: Let me go back.
>>: So how is it related to kind of the relationship between the bounded [inaudible] and the
bounded in the convex hull?
>> Ben Recht: Oh, so if you take the size of the John ellipsoid? Is that what you're saying?
>>: The John ellipsoid it's always bounded by the square root of the…
>> Ben Recht: Yeah.
>>: But if you take [inaudible] as opposed to ellipsoids, so now the, you don't have the nice
bound of the John ellipsoid, so it's kind of…
>> Ben Recht: Yeah, yeah.
>>: So it's kind of…
>> Ben Recht: You take the smallest ellipsoid sphere and then you blow up the biggest
enclosing sphere.
>>: Yeah.
>> Ben Recht: I don't know. That sounds like, do we know how to characterize that, that
number?
>>: You know, it's [inaudible].
>> Ben Recht: It's a number. It's a number.
>>: [inaudible] asking you…
>> Ben Recht: It's a number. Yeah, yeah, yeah, yeah no, I don't know. That's a cool question.
>>: So this mean width in learning theory we just call it the Gaussian complexity.
>> Ben Recht: Uh oh, yes, that's right [laughter]. That's right. I was going to get there.
>>: Oh, you were going to come…
>> Ben Recht: Yeah, yeah. I was going to get there in the next slide, but that is exactly right. So
in learning theory we call it the Gaussian complexity; that's right. And why is that? It's a little,
complicated. So here we are doing the interval over the sphere and just imagine you add, so if I
add the length of the Gaussian in front now I have the interval over the sphere and over times
the Gaussian I can now have this width be the average over all Gaussian directions. That was
the Gaussian complexity exactly, and it turns out this is not, I don't think it's completely
obvious. But it turns out that there is only one function that, and it's called an intrinsic volume
up to a constant, so if it's essentially isotropic over all directions, it's either the Gaussian
complexity or it's the mean width. If you feel like you have a good intuition for Gaussian
complexity, that's all this is. I'm not sure I have a good intuition for Gaussian complexity either.
It's just that I kind of know how to play with it. I know how to bound it [laughter]. And we
know that it's related to all of these facts like it's related to logs of covering numbers and so we
kind of--it has all of these kind of natural geometric characterizations. I don't know about the
one you talked about though which is kind of cute, which is the smallest enclosing sphere, not
ellipsoid ratio to the largest enclosing sphere.
>>: But that's the type of margin type of arguments, so if you replace this with Gaussian
complexity and then you have what's the Gaussian complexity of like a margin based linear
classifier, that's exactly that, so that will give you the answer. Or maybe [inaudible] the
bounds…
>>: It's also the measurement of how isotropic or not isotropic is your shape right?
>> Ben Recht: Right. Exactly, and especially for cones, if we think about it, I mean if you think
about it:, it's never going to be that isotropic, so we're going to take a cone and intersect it with
a sphere, so first of all is going to have this nice origin. This is nice and unique and then we just
have to say how wide or how much volume or how much kind of width are we spanning out
here? And it turns out again, as I was going to say, there is a really nice theorem by Gordon
which is based, or, this follows also from, if people know Slepian-Gordon, you can kind of prove
this in the same way if you know Slepian-Gordon. If you don't, just trust me on this. This is a
nice theorem. If I assume that I have a random subspace, I'm going to say what is the
probability of random subspace intersects some convex body that then is going to, the
probability is going to be, the probability that it intersects at the origin would be very high if the
co-dimension of the subspace is bigger than n times the mean width squared. Or if you want to
over it, it's just the Gaussian complexity squared. So n times the mean width squared is equal
to the Gaussian complexity squared, if you are more comfortable with Gaussian complexity.
And if we want to think about, let's go back to our inverse problem, after this bit of a digression,
for inverse problems if phi is say just a random Gaussian matrix, it doesn't have to be
necessarily random Gaussian. There are other models, then the number of measurements that
you need is going to be n times the width of this tangent cone thing intersected with the
spheres. Now the question is how do we compute these things. Well, you go talk to Over; he's
got tools [laughter]. You go ask people who know how to do things with Gaussian complexity.
We have a trick, the method we proposed to bound these things in the paper uses a trick from
convex duality that actually makes it pretty easy, but only for these problems that come from
inverse problems. So if you note that the duel of the support function is that the distance that's
the trick that we use and it turns out to make some of our computations pretty easy. Before I
say what the consequence of those computations are let me just say that the width also
governs noisy recovery in some sense, so let's do one simple noise model rather than doing like
a stochastic noise model, let's do a worst-case noise model. Let's assume I see not only, I don't
see phi X; I see phi X plus some disturbance and I am going to bound the disturbance in L2. In
that case maybe I would do the second order cone problem or maybe I would do a lasso like
looking problem or a penalty type method problem, and in this case we actually have another
bound which says that let's say I take the optimal solution of this guy and then I want to
compare it to the true thing that I am looking for, so I am going to look at the difference
between X minus the X hat. Then that is less than two Delta over epsilon. Remember Delta is
the norm of the disturbance and epsilon is this little fudge factor that we put in the
denominator, so if you want to make epsilon big as epsilon goes close to one this is going to
blow up, but if we just pick epsilon equals a half we are within a constant of where we were
before, so we get robust recovery after a constant factor more samples than what you need in
the case to get exact recovery. We can analyze this with stochastic models as well, but I'm
almost out of time, so we're not going to talk about that today.
>> John Platt: You can keep going.
>> Ben Recht: Well, you guys kick me out when I'm done.
>>: So can we look at this--I mean I'm just having a little bit of difficulty processing this
[inaudible] stuff because it just doesn't look like the standard way I'm used to. So like for one
of these concrete, one of the things that are really a motivating problem, do you get the same…
>> Ben Recht: Check this out. This is really cool. And we could go through this. If you guys
want me to pull the screen up, I can derive all of them for you to if you want. They are actually
very easy. This is actually crazy; we do this duality trick and the duality trick is nothing more
than saying if I want to know what the width of the cone is, the width of the cone is actually
equal to, or sorry is upper bounded by the distance to what is called the polar cone. So this
width, this means width is upper bounded by the distance of the polar cone. If I want to
compute the expected value of a distance to some set, I could just pick a point in the set and
that's what we do. So we pick a point in the set and we come up with some onsauces
[phonetic], and we try to be clever and then with very simple arguments, for example, the
hypercube we get exactly the rate that is known. And this is a little weird rate. It actually
makes you think that actually we are carving the corner of hypercubes, actually takes a lot of
measurements and right, that makes sense, because just for me to transmit them to you in
some efficient way I need n over two numbers. What I would do is I would tell you whether or
not I'm going to tell you where the positives are or the negatives are, and then I would tell you
where their locations are, so even just to encode that I need a number two.
>>: So the width [inaudible] the width of the scales is dimensionality and so you [inaudible]
width squared you get stuff [inaudible]'s scales [inaudible] n cubed.
>> Ben Recht: That's what's weird. The width square scales to the square root of n.
>>: The square root of n?
>> Ben Recht: Yeah. It's kind of a bizarre; it's kind of a bizarre thing. The width of the sphere is
root n. It's just like the mean norm of a Gaussian, so as far as the square root of n.
>>: So you would expect a lot of these [inaudible] n squared.
>> Ben Recht: No it's n.
>>: But isn't there an n times the width squared…
>> Ben Recht: Sorry, sorry, sorry, sorry. I'm sorry. The n times the width, sorry the--let's go
back. Now I got all… This thing, right? N times the width squared is going to, basically, the
biggest that is going to be is n. The width of the sphere, if I put the width of the sphere in I get
one times n. Sorry, sorry. I was putting those two things together. And remember--what were
you thinking?
>>: No, no. Confused.
>> Ben Recht: Okay? Okay. I am too now. We are all confused. Does that make sense?
>>: I was expecting super linear sample of what comes out of this because the width squared…
>> Ben Recht: Let me just tell you the punch line. Let's go to the punch line. Sparse vectors,
we do the same thing, and again, in a couple of lines; it's not hard. And again, if we want to
stick around I can actually show you how to prove it for the sparse vectors. You've got 2s log n
over s plus 5s over 4. Now why do I put in these constants? I put in these constants because
too many people were compress sensing and they yell at me if the constants aren't there.
These are the best [inaudible] constants I've seen that I know of for compress sensing with
Gaussian matrices. Even the Donahoe and Tanner result which actually gets this 2 is only in
asymptotic sense. This is not asymptotic and to compute this you basically just have to
remember how to integrate the Gaussian q function, which is something I always forget how to
do. That's all we have to do is estimate the q function.
>>: [inaudible] positive there's a structural solution?
>> Ben Recht: Yeah, the model, yeah. That's correct.
>>: [inaudible] assumptions.
>> Ben Recht: Correct. We always assume that there is some structure. Similarly, if we have a
union of subspace model, one of these ones that we were just discussing that has, and we're
using this norm, and let's say there are m subspaces or m groups and I'm going to do group
lasso and the largest size of the group is going to be b and then we have k active groups. The
sample complexity grows like this, and it's not the most elegant expression but actually if we
think about limits it makes sense. And the limit of that block size is much smaller than the log
of the number of groups. This is just k times 2 log m. This is what we have over here up to that.
We don't have the divide by k; but that's okay. And then limit that the block size is huge and
there aren't that many groups we just get k times b and that is the number of parameters. We
have that kind of interpolation between those two. Yeah?
>>: Another question, so your [inaudible] it seems you make this assumption that [inaudible]
matrixes from a Gaussian random matrix…
>> Ben Recht: Yeah these are all for Gaussian random matrices.
>>: Okay, so I mean in machine learning examples, for example the 5 is given, so like the
amount [inaudible] instead of compressing, so is this the [inaudible] price?
>> Ben Recht: Yes. We don't get these numbers anymore; I mean obviously you don't get
these numbers anymore, but we can actually handle that. The reason I was bringing this up
again is to say if you can compute the mean width tells you what happens for Gaussian matrices
or essentially for generic matrices. To go into details as to what happens when you want to do
the deterministic design, you have to do something else, but again, we can just use the same
machinery and get at least a generic crank for how you compute those things. And in that case
the rate is controlled both by how your phi matrix basically scales with respect to all of the
models, all of the atoms, and is controlled by the Gaussian complexity of basically the
complexity of the noise and the dual norm. But we can talk more about that at some point. I'm
almost, I am out of time but you guys just tell me when you want me to stop.
>>: [inaudible] [laughter].
>> Ben Recht: I'll keep going. Just one more. This one is really cool. So I take lower end
matrices and in Gaussian models, and so maybe not the realistic thing as [inaudible] was
pointing out, but still we want to say Gaussian model. Lower matrices, n one by n two matrices.
We want to recover them using this like blog heuristic thing. Rank is r and the number of
parameters you need, the number measurements you need is 3 times r times n1 plus n2 minus
r. Again, I'm being very specific about the form and the reason I'm being very specific about the
form is this is the number of numbers you need to write down the singular value
decomposition. This is the number of parameters in the singular value decomposition. You've
just got to think about how many parameters there are in the two orthogonal matrices and
then you have the r singular values.
>>: So you are saying [inaudible].
>> Ben Recht: Well, up to the three. Sorry, up to the three, so it is within a factor of three,
yeah, the best you could do, which is pretty cool. And again these are, and you can go and you
can get crazy with integrals and we can actually do this just for generic cones and you get a
nice, and this is actually why it's good you actually just do the analysis for the Gaussian case to
see how, is the scaling reasonable just in general. I think the answer is yes, because if I have a
general cone and I have this polar cone c *. Polar cone is basically all the guys who have a
negative dot product with the cone. That's all the polar cone is, so if you, yeah, see if you try,
[laughter] that's exactly the picture; so if you have a cone, the polar cone kind of tends to look
like this, kind of shoots off the back there. Essentially if the cone itself is narrow, the polar cone
is going to be wide and essentially we get something in terms of the surface volume of the polar
cone intersecting the sphere. It looks a little wacky but if you look at the corollary and let's say
we have a polytope and I want to use the, now instead of having some bizarre atomic set, I
really just have vertices; I have some polytope. And I'm going to assume what's called vertex
transitive which basically means that it is just a symmetric object. If I look at all of the automorphisms of the polytope, all of the rotations that send it back to itself, I can send any vertex
to any other vertex. So the permutation matrices are an example. The cut matrices are an
example. In this case the number of measurements you need is like nine times the log of the
number of vertices. I don't know if that nine is real, but it's nine times the log of the number of
vertices.
>>: [inaudible] cone [inaudible] sharp edges only if, pretty much only if you're [inaudible]?
>> Ben Recht: So it's not quite true because even those lower rank guys have sharp, you can
get sharp cones. They are partially sharp and then partially smooth. You get these bizarre
combinations of the two.
>>: Okay. But if the set of [inaudible] is finite it's going to be like that? The convex [inaudible].
>> Ben Recht: Correct, correct. So what happens here? Keep going. For permutation matrices,
n log n—huh?
>>: [inaudible] converse is lucky.
>> Ben Recht: The converse is lucky, exactly.
>>: [inaudible] likely.
>> Ben Recht: You think it's likely?
>>: If you have an infinite set of [inaudible]…
>> Ben Recht: That the cone will actually not be…
>>: The cone is likely not to have only sharp edges.
>> Ben Recht: Only sharp edges.
>>: [inaudible]?
>> Ben Recht: I think the converse is true, actually now that I am thinking about it. Lynn will
have to help me. If all of the cones, if all of the normal cones are polyhedral cones, then the
body is probably a polytope, right? I look at all vertices and I take for all extreme points, I'm in.
>>: Yes. But it doesn't take [inaudible]. [multiple speakers].
>>: [inaudible] finite.
>>: [inaudible] subset within the…
>> Ben Recht: Yes. Sure. Sure.
>>: [inaudible].
>> Ben Recht: Sure. I see.
>>: [inaudible] add vertex [inaudible]…
>> Ben Recht: Yeah, once you add vertex transitive, then I think we are done. So this basically
means that every atom could be mapped to virtually every atom just by some rigid
transformation of the convex Hull, so at that point they are just vertices. Sure. Cool. Oh, good.
>>: Algorithms.
>> Ben Recht: I was always trying to predict what's coming next. All right, that's it, algorithms.
Let me… I have a lot of stuff on algorithms, but I'm just going to do just a quick two slides of
highlights and then we can skip the, we can talk later or if you guys want to hang out I can tell
you more about what we're doing algorithm wise. I did want to say at least what is also nice
about this framework is we have a way of predicting measurements. I didn't tell you how to
deal with the stochastic noise in kind of a lasso model like this. But if you want to stick around I
can tell you about how we do that too, and this is a more recent development, but what I want
to say is what is the algorithm. I already hinted at with the algorithm is. It's always the same.
It's always the same and it's basically this you run that iteration. You run this six-point iteration,
and fine, maybe we have to change A to sometimes the learning rate and fine, maybe we are
going to do this approximately and fine, maybe we're not going to use all of phi, but we are
going to do stochastic samples, but in some sense it is always this. The easy part is this internal
residual which is just saying how bad, I compute this thing saying how bad is my current
prediction to the thing I want to predict, and then I have an adjoint operator which maybe if it's
not least squares it will be some kind of nonlinear thing that will happen here, but it will be
something along these lines. I have some steff [phonetic] size. The only funky thing is this pi
operator which I'm just calling generically the shrinkage operator. This is really the only thing
you have to change if you change the prior.
>>: If you change the norm.
>> Ben Recht: If you change the norm, you change your structure. And this one is the proximal
operator of the norm, exactly. And that is the only thing you have to change. That is the thing
that you have to understand how to solve, and this is the thing, this is nice for us as algorithm
designers because we could forget about the other part for now, because we've all studied the
other part pretty well, and we can now apply this stuff where you guys have looked at lasso
problems. Well, anything that works for lasso problems should work here as long as we can
compute the proximal operator. There's just got to be able to compute the proximal operator.
Now of course, how do we do that? So either this thing will be easy to compute, which will be
lucky. So in the case of L1 we know how to do it. In the case of nuclear norm we know how to
do it. In the case of that factor version of the nuclear norm we know how to do it, that weird
[inaudible] type thing. Sometimes we don't know how to do it. In that case what we can rely
on is relaxations of a various kind. So the first one is just to say if I take the set that I start with
and I put it in a bigger set that means that the bigger norm will be smaller than the norm I
started with. That makes sense. Like the L1 ball, the L1 norm is bigger than the L2 norm, for
example. And so basically if we had some way of constructing a, I start with my atoms and I
have some way of embedding the atom out then I can at least solve the big one, then maybe
hopefully I didn't blow it up by too much and this will work okay. So we have a hierarchy of
ways of doing this based on semi-definite relaxations. They have a fancy name called theta
bodies because they are semi-definite relaxations of convex hulls, but for these basically we can
actually go and get tighter and tighter bounds on these atomic norms for very, very general
structures. Yes?
>>: If you had the case where the convex hull [inaudible] so your cone is going to be [inaudible]
and if [inaudible] : is going to be [inaudible] and your next [inaudible] you [inaudible] in that
case.
>> Ben Recht: Okay, one more time, one more time I don't think I got that.
>>: Suppose you [inaudible] X on your [inaudible]. And you look at the cone and the cone
intercepts a specific model X [inaudible] direction and you can take the [inaudible]?
>> Ben Recht: Yes, yes, yes, absolutely.
>>: So that the shrinkage operator should always be solvable within NP.
>> Ben Recht: Only if it fits polyhedral. If it's a polyhedral, then absolutely. Absolutely it is
solvable by an [inaudible]. Absolutely. And moreover the algorithm you are talking about is the
homotopy algorithm. It is the simplex algorithm, the parametric simplex algorithm. It's what
people do, so this [inaudible] has been invented many times in L1 but if this is the L1 norm, we
do Leon’s algorithm where you just do you started a vertex. You find your decentered action.
You find that you are piecewise linear for a time until now the space shifts and so now you have
to switch faces. That thing is called, it has a lot of different names.
>>: Lars.
>> Ben Recht: Lars is, the reason why I don't say Lars is that is actually not the right way doing
it [laughter]. Lars was invented by, I mean arguably brilliant statisticians at Stanford, but they
didn't know linear programming, and actually the first instance of this was pointed out to me by
Levan Vandenberg, that algorithm for L1 was invented by Wolf in 1956.
>>: Frank Wolf?
>> Ben Recht: It's not Frank Wolf though. It's this parametric simplex algorithm. And what's
cool about that, what's also nice about that algorithm is that it essentially allows you to
compute if this thing is polyhedral, it allows you to compute this optimization problem for every
value of mu.
>>: [inaudible].
>> Ben Recht: Then we are out of luck [laughter]. Then we do this [laughter].
>>: Going back to the [inaudible] I'm missing something. So you are saying that you can't
compute this, you can't compute [inaudible] operator because it's too hard for your A, so you
just substitute your A?
>> Ben Recht: Yeah. I'm just going to make a different norm. Yeah. I just make a different
norm. I substitute A with maybe something a little bit bigger. That gives me an outer
approximation and it turns out if I only want to go with an outer approximation, let me tell you
two things we can do. I don't think I can solve the second one but I did solve the first one.
Thing one that you can do you can just approximate the convex hull from the outside and
basically we have a way of doing that using semi-definite programming, and if you like say apply
this to cut matrices, these rank one sine matrices, and you apply this machinery, you get this
thing called the max norm which is something that Nati Srebro the kind of really popularized
and we have been working off of for a while. Now the second thing that you can do which is
actually something I have been very fond of lately, and I know I didn't put this slide in but I will
just tell you what this is. You can get an approximation from the inside as well. You just take a
subset. In fact, you just grid your atoms. Now sometimes that doesn't work, but sometimes
that works just fine and sometimes that actually ends up giving you an excellent approximation.
In that case you have NLP. Yes?
>>: The regional norm if it's a [inaudible] means that you'll get the optimum on one of the
edges, right?
>> Ben Recht: Yes.
>>: And this will gain you the sparsity and these nice properties that you are interested in.
Once you've found it usually it's going to be kind of a smooth body because this is easier to
work [inaudible] but then you lose exactly the property that you wanted to achieve to begin
with.
>> Ben Recht: Yes.
>>: [inaudible] [multiple speakers].
>>: It's like solving integer programming where by…
>> Ben Recht: Oh yeah.
>>: And then you round, which never worked well…
>>: In terms you lose the whole thing.
>> Ben Recht: We don't, it doesn't work. Okay there are two different, yes, correct.
>>: That's what I was very confused by.
>>: Because what you're saying is that you can always take the bounding [inaudible].
>>: Yep. And then you're back to L2. You are solving [inaudible] again that seems kind of bad.
>>: You lose all of the structure.
>> Ben Recht: You lose all of the structure. So here's what I can tell you and this is actually kind
of weird. So you are right, that this seems to be erasing. This picture sucks.
>>: No, it's actually good, because…
>> Ben Recht: Well…
>>: [inaudible].
>> Ben Recht: But, but…
>>: This is a problem, you know, it sucks for you because… [laughter].
>> Ben Recht: Let me tell you what actually happened because this is fascinating to me. If you,
so let's say this [inaudible] problem is the one that I've looked at. You start with your thing and
you have your A thing that, this [inaudible] operator is NP hard to compute. We do this semidefinite relaxation on this A thing. You show that the number of measurements required--and
it turns out first of all of the vertices are still exposed, one. Number two, to get exact recovery
the number of measurements blows up by a factor of two. But your computation time goes
from exponential to reasonable. That's wild. That is totally crazy.
>>: [inaudible].
>> Ben Recht: Huh?
>>: [inaudible].
>> Ben Recht: Well, we do, I do another one of these factorization hacks and then I run some
gradient descent and then if you want, we can solve very quickly.
>>: [inaudible].
>>: I'm still confused because…
>>: [inaudible] the way to phrase it is to begin with you said my problem is that I have too
many free variables. I have too few measurements so I have to make some structural
assumptions.
>> Ben Recht: Exactly.
>>: Now what you say is look if I make this [inaudible] assumption [inaudible].
>> Ben Recht: That's correct.
>>: Maybe in fact I can't compute this norm, right?
>> Ben Recht: Right.
>>: So maybe it falls down to where I can't compute this norm.
>> Ben Recht: So I add more hypotheses…
>>: Instead you say okay, I will bound it. But actually you can take this, roll it back and say
okay, I'm simply making a different structural assumption.
>>: A weaker one.
>>: It's a different one, doesn't matter, right? Then, and this is actually what you're doing.
Instead of making say a sparsity assumption, right, a true sparsity assumption is something that
we can work with because as zero. The…
>> Ben Recht: No, no, we do it, I mean I agree with you 100%. I don't think that the sparsity
versus L1 is actually the, exactly what you said. It's basically another way to say it is we start
with a set of atoms and candidate hypotheses, I note now know that the thing, I can't solve it,
so I add more hypotheses.
>>: Right.
>> Ben Recht: Okay, fine, I know that they are not there…
>>: So you can say one way to put it is yeah, I compute the known differently, but actually you
can roll it back and make it more explicit by saying actually this is my structural assumption. My
structural assumption is now different. It's not that…
>>: Use A2 instead of A1.
>>: Yeah.
>> Ben Recht: Yeah.
>>: [inaudible].
>> Ben Recht: Yeah.
>>: But then I still don't know how to go A2 to A1, if you really care about A1.
>> Ben Recht: But let's say, the reason I don't like it…
>>: [inaudible] arbitrary. A1 was you just wanted sparsity, right, because yeah, it looks fine.
You know that you need to make some structural assumption so maybe this structural
assumption is hard to work with. Here is a different one.
>>: Yeah, which is a superset.
>> Ben Recht: A superset yeah.
>>: In other words…
>>: It still happens to be superset.
>> Ben Recht: Yeah, it could be; a subset would be easier, too, if you get lucky, a subset could
be easier. Like for example, I could say like it's this one [laughter] that's a better assumption.
>>: You know….
>> Ben Recht: No, really that's actually really good because you are right. It's changing
basically what this, okay, I like that. What this hierarchy is doing is changing the structural
assumption. The only reason why I don't want to go completely in on it is because I never
understand what, I mean I very rarely understand what the things that come out of this
machinery actually are. What structural assumption do I have? I don't know.
>>: But this is kind of, you know, when machine learning became popular to have
regularization terms which is kind of an arbitrary thing to, I mean we don't understand what it
does; it just happens to be working, but if you are capable of rolling it back and explicitly saying
that this is my assumption, this is very powerful, because now I look at my problem and I can
say does this assumption look reasonable toward the problem that I am having or is it just hides
in some unknown bound that is just a guess?
>> Ben Recht: No, no. So the thing we can characterize how much the, I mean oftentimes we
can characterize how these change.
>>: [inaudible] do you want [inaudible]?
>> Ben Recht: Do I want my million dollars? [laughter]. They gave it away, man. I know, I
didn't quite get there. I see.
>>: I'm sorry, so I'm still a little bit lost.
>> Ben Recht: Me to now.
>>: [inaudible] huge rounding [inaudible]…
>> Ben Recht: No, no this is the point. What's your name?
>>: Ronnie.
>> Ben Recht: Ronnie, so like what Ronnie was saying is quite correct. We are just changing
the structural assumption. You don't have to round.
>>: [inaudible] bodies or anything you just throw away.
>>: [inaudible] to say where it's connected, you just something you say there is some relation
[inaudible] so one is strictly bigger than the other one is strictly smaller, or sometimes that we
can even give a constant bound on the difference which is fine but actually you don't need that.
>> Ben Recht: But the thing that I can guarantee…
>>: [inaudible] the problem is that you can't.
>> Ben Recht: But Ronnie the one thing that I will say is that the thing that is nice about the
theta bodies is that they are guaranteed to keep the things on the outside of the--so it's not like
a bad thing to be. A bad thing would be, I'm trying to think of a good example. You can
imagine you have, I don't know if I'll be able to draw this picture. Yeah, okay, I can. Okay, we
are on the unit sphere and we take the, this is not quite nice. No, that's not quite it.
>>: [inaudible] portal [inaudible].
>> Ben Recht: Yeah, it's just that the nice thing about these things is that you're guaranteed
that your, the points you start with, the edges you started are still pointing at your extreme
points.
>>: Yeah but, yeah, it's nice in cases where you can do that, but actually nothing--for example,
in the random walks when you run the walks sometimes these points are [inaudible] points to
work with because exact, they tend to be narrow and stuff like that, so sometimes they take
into the section of this shape [inaudible] which kind of makes it isotropic and stuff and we can
say that for one reason or another I want to chop off these edges. Why? Because. But since
the choice of the structure is arbitrary, why low rank. Yeah, it could have been anything. When
you make a low rank assumption is just because it seems nice and we hope that we can work
with it, but there is no underlying truth of nature that says that…
>> Ben Recht: Uh oh [laughter] we were having this conversation earlier we were having this
conversation earlier.
>>: But that's the wrong example. [inaudible].
>>: Because [inaudible] is compute one last. Here the [inaudible] is not something [inaudible]
you think about.
>>: [inaudible] to see if A2, so actually I say I want to work with A1. I can't because of
computational constraints, so I work with A2.
>>: So you are throwing away A1 entire [inaudible]. That's the only question. Are we throwing
away elementary [inaudible] incomprehensible gulf that happens when you…
>> Ben Recht: I have it on the next slide. Sh**.
>>: Described it to me in simple--I am just a poor CS person. What's going on?
>> Ben Recht: Come on. I'm on to you. This is the problem. I am on to you. Where is my-there we go. Here's how it works.
>>: Okay.
>> Ben Recht: It's actually a cute trick. So let's actually go into what the details are. We have
to assume something a little bit wacky. It's not that wacky. We assume that A is algebraic. All
of the examples I gave actually are. And what is an algebraic variety? It's a fancy way of saying
it's the zero set of some set of polynomials.
>>: [inaudible] polynomials?
>> Ben Recht: Yes. So f live in something called ideal but all I have to do is think of it as is that
it's the zero set of some polynomials. For example, if I want to look at the, if I want to look at
the set of all sparsity one unit norm vectors, they are the guys, they are the vanishing points,
you know, you could construct this polynomial X1, X2, X3, X4, X5, X6, X7, X8, Xd, okay? There
are the zeros to that. The zeros of X1, the sum of the squares equals one. That's the variety
that gives rise to these sparsity one guys. And you could do the same thing for all of these
models. That's really not the important part. But that's the thing that allows us to do
computation. The important part is this. The dual norm, look if we can compute the primal
norm then we can compute the dual norm and this is kind of important. So if you compute that
shrinkage operator for the primal norm, that's actually just projection onto the unit norm ball
and the dual norm. That's, those two things are completely the same. So you just have to be
able to do one or the other. So let's look at the dual norm. The dual norm has a cool form. In
the first case you had the blowup the convex hull. The second case we just take an atom. We
take [inaudible] and we take his max and it's too bad that Over had to leave. Oh, look he's back
[laughter]. It looks like a Gaussian complexity. The dual, if I put Gaussian noise in here this dual
norm is now just the Gaussian complexity of the set A. right so that this is the thing that we
know. So now tao, we want to say when is this thing less than tao. So that's true if and only if v
[inaudible] with A is equal to tao minus q of A where q of A is positive everywhere on the set A.
I mean that make sense. Sure, that is less than. What is this function q of A, well I'm going to
decompose it. I'm going to pick a form for q. I'm actually going to pick a form so this is basically
going to equate, this is going to pick a form for this linear function. I have a linear function and
I'm going to parameterize it. I am going to parameterize it in two parts. The first part is h
which is positive everywhere on this, for all X everywhere, so it has to be a positive function.
The second one I am just going to take g to be something in this set of polynomials, or actually
the linear span of that set of polynomials and products of all things times that set of
polynomials. I can do a lot of wacky things. I just want g to be a polynomial that annihilates all
of the atoms, so g has to be zero on all of the atoms and h has to be positive everywhere; there
is a parameterization. It turns out for algebraic things this is a tight parameterization and you
could always find, this is actually if and only if for algebraic varieties. In general, this is a nice
way to do an approximation though and this tells you how to make more approximations. I just
kind of have to come up with a good parameterization for linear functions in terms of positive
functions and things that vanish on the set of atoms. And so it's really designed to be very
catered to the set of atoms that you started with. So the relaxation, again the theta bodies
come from when it says saying that h has to be positive function, we say that h has to be a sum
of squares of polynomials, and it turns out once this is a sum of squares of polynomials and g is
in the ideal, the whole thing, the whole thing of computing the dual norm is now a semidefinite program; or this approximation is a semi-definite program. That's the mumbo-jumbo,
but it is kind of a concrete mumbo-jumbo that we can solve sometimes, and I know this guy
[laughter] we have tricks for doing these STPs and that's basically how we do that and it always
gives us a lower bound and yes, there is a nice survey of these things by, actually Rake
[phonetic] is just across the bridge so you can go tell her to come over here and we can talk
about theta bodies at some point because she would do a better job than me. So that's kind of
one way to make these approximations.
>>: So you make a nest of STPs that hopefully get you closer and closer and closer.
>> Ben Recht: Yeah, and it works, I mean for the problems where we have been able to do it,
so I mean it's not always doable, but it's doable for that cut norm problem, for these cut majors
problems and it works well. It turns out that it ends up working better than the nuclear norm
on a lot of different problems that we've tried, so we actually have some reason to think the
you are looking for, sums of clusters or things with bounded dynamic range that maximum is a
better thing to use and all you have to do to change the code, that blog heuristic is instead of
doing a step, all you basically do is you take a step along the gradient just the same way as we
did before in the L&R and then you guarantee that their norm is less than some bound. You
just squish their norm and that's it, so a very minor adjustment.
>>: I have a single question. So this reminds me that sometimes [inaudible] proximal range
[inaudible] with respect to A is hard to compute. It could be the case of the dual norm on the A
is very simple. So…
>> Ben Recht: Well, if the dual norm is easy then the proximal norm is easy. I have that one
here. Wait, wait. Hold on a second. That should be right above this. Oh, it's not there. Man,
sometimes you want these slides to be there and then they just disappear. Anyway, the dual
norm and the proximal--so we look at the solution to one and the solution to the other and you
know what, rather than try to; I am just going to open another presentation where I know
where it is. Come on. IPM, there, shoooo, yeah, there we go. Okay? If you have the proximal
operator…
>>: Yeah, I don't mean [inaudible], that's why I've-possibly we can avoid the proximal
[inaudible] and use [inaudible] radio waves to solve it. Basically I compute the gradient and so I
take the minus gradient and I take the item A that's maximize the inner product with the
[inaudible].
>> Ben Recht: Yeah, right. See, there are two problems with, I mean greedy algorithms are
actually a great way to go and actually this atomic stuff is…
>>: [inaudible] become more of a problem [inaudible] proximal [inaudible] sometimes A is very
hard.
>> Ben Recht: There are two things that are troublesome about the greedy methods. One is
that you can get the sparsity bounds with them. Oftentimes, and it works for L1, but
oftentimes even for nuclear norm, the greedy methods give you full decompositions unless you
are very careful. You have to do some funky stuff. The second thing is that the greedy methods
are often no easier than computing the proximal point method, and they amount to doing the
same thing. If you look at a dual norm you are finding the maximum kind, you have to find the
maximum atom that is correlated with some vector. That's the same thing that you have to do,
if you can do that, you can solve a greedy algorithm. You can also solve the dual norm
projection. Either way, you have the same complexity. Anyway, so this is… Anyway, we are out
of time guys. Well thank you, thanks everybody for paying attention. [applause].
Download