>> Anand Louis: So I talk about higher-order Cheeger’s... partitioning problems. So firstly what is graph partitioning? ...

advertisement
>> Anand Louis: So I talk about higher-order Cheeger’s inequalities and, for graph multipartitioning problems. So firstly what is graph partitioning? And so you are given a graph and
you want to partition it into some pieces and generally a measure of how good the partition is
some function of the edges that go between the pieces and the sizes of the sets. So one of the
most well studied problems is the sparsest cut problem where you are given a graph and for
any set S we define its expansion as the ratio of the edges that go out of the set divided by the
size of the set. And we denote this by phi of S and the expansion of the graph is the minimum
value of phi of S over all sets of S size at most n over 2. Right, clear? So this is a fundamental
NP hard problem and it is sort of been of interest to the Markov chain people because if you
were to do something like a random walk then the missing time of the random walk depends
on the expansion of the graph and many of these divide and conquer algorithms go first find
the sparsest cut parses on each pieces and so on. So trying to compute the sparsest cut is very
important NP hard problem. So there's this thing called Cheeger's inequality which is a very
fundamental way of measuring the expansion of the graph. So the Cheeger's inequality says
this. So you look at the Laplacian matrix which is, for the d regular graph it is the identity matrix
minus the adjacency matrix divided by d. So see this matrix is symmetric and diagonally
dominant. It's easy to see that the smallest eigenvalue of this matrix is zero. So you just take
the eigenvector with the [inaudible] vector and that is an eigenvector with eigenvalue zero.
And the very simple exercise to show is that if your graph has something like k connected
components then their first key eigenvalues are all zero. Okay. So what Cheeger's inequality
says is that if you look at the lambda two, the second smallest eigenvalue and the expansion of
the graph is at least, lambda two and at most two times square root lambda two, and the proof
of Cheeger's inequality also gives you an algorithm to find a set which satisfies this upper
bound. So you sort the entries of the second eigenvector in decreasing order. Let's say x1 to
xn, x1 being the largest and xn being the smallest and you look at the cuts defined by this
ordering. So Si is the set consisting of the first [inaudible] in this ordering. Then the proof
essentially says that, the minimum of phi of Si over says that at most 2 square root lambda 2.
So it gives you a simple algorithm to find a set that satisfies this upper bound. Okay? So in this
talk I'll talk about two problems. You are given a parameter k and the first problem is to
partition the graph into k pieces so as to minimize the max over i of phi of Si. And the second
seemingly easier problem is to find a k partition that minimizes the total fraction of edges that
you cut. Okay? So it's easy to see that whatever upper bound you prove for the first problem
will also be in upper bound for the second problem. So the second problem is an easier
problem, so let's do that first.
>>: [inaudible] fraction [inaudible] cut?
>> Anand Louis: Yeah.
>>: But is it easier?
>> Anand Louis: I mean so whatever upper bound to prove for the first problem is an upper
bound for the second problem, right? For the first problem I want you to find the k partition Si
to Sk, so as to minimize the max over i of phi of Si.
>>: Really I mean in one case you partition [inaudible] say you maximize [inaudible] cut about
the single number [inaudible] care about k number. Not in this particular case but there are
many questions the next [inaudible] where you want to [inaudible] always around that same
[inaudible] would like to know how many edges would remain [inaudible] much harder.
>>: I know upper bound is not in terms of [inaudible]…
>> Anand Louis: Right, right. I'm just saying [inaudible] upper bound. Something like in terms
of function of k or lambda k or something.
>>: The objective you mean or the [inaudible] when you say upper bound is it a solution you're
talking about or the number you are talking about?
>> Anand Louis: The number. So the total number of -- so suppose you prove that this term is
at most say some alpha, then I'm saying that the upper bound here is also at most alpha.
>>: [inaudible] [multiple speakers]
>> Anand Louis: Uh-huh. I'm just talking about absolute numbers, not approximate [inaudible].
>>: [inaudible]
>> Anand Louis: So for the first problem, so there's a very simple recursive algorithm, right? To
the Cheeger's inequality gives away to find one cut, so you find that cut, remove those edges
and add self loops in their place. For each edge that you will remove, you add a self loop at
[inaudible] and a self loop at [inaudible]. And then you repeat. Now we have two pieces. Look
at which piece has a smaller second eigenvalue and do this again until you get k pieces. I'll
show you this very simple algorithm cuts at most square root lambda k times log k fraction of
edges. And so this is an absolute upper bound. It's not like an approximation factor or
anything. Okay? So to prove this the first interesting observation is that if you remove any
edge from the graph, your eigenvalues are only going to decrease. So I'll prove this for you and
it's a very simple thing. So first note that, so let c be the set of edges that you will be removing.
Then the adjacency matrix of G minus c minus the adjacency matrix of G is a diagonally
dominant matrix.
>>: [inaudible] second eigenvalue this is Laplacian?
>> Anand Louis: Yes, of course. So do you see why the adjacency matrix of G minus c minus the
adjacency matrix of G is a diagonally dominant matrix? So which entries do these two matrices
differ in? Only the entries corresponding to the edges in c, right? Everything else is the same.
So in this difference you'll get a -1 for every edge that you removed, but for every edge that you
remove, you also added self loops and [inaudible] the endpoints, so you'd get a plus one in the
diagonal entries corresponding to them. Right? So this is a diagonally dominant matrix and all
diagonally dominant matrices are also…
>>: Sorry, silly question but diagonally dominant matrix is something that values the diagonal
meet the sums of things?
>> Anand Louis: Yeah, so for every…
>>: [inaudible] you have degrees on the diagonal?
>> Anand Louis: Yeah. So diagonally dominant I mean not the eigenvalue entry is larger than
the sum of the [inaudible] and larger than some of the [inaudible] column.
>>: [inaudible]
>> Anand Louis: Yeah.
>>: So when you normalize [inaudible] after you [inaudible] no longer…
>> Anand Louis: No, I'm adding self loops in their place so it's just still d regular. So when I am
counting the degree I am counting the self loops also.
>>: [inaudible]
>> Anand Louis: So it's d regular, so it's diagonally dominant and positive semi-definite. Okay.
So let k be an arbitrary number. So the k eigenvalue of G minus E is given by, you minimize over
the subspace of rank k and within that subspace you maximize x transpose the Laplacian of G
minus C times x divided by x transpose x. Right? So let me just reel in the summation and write
it this way. So the Laplacian of G and G minus C differ in exactly this [inaudible]. And what
does this, so this is a PSD matrix, right? So this term is always going to be non-negative so let
me just throw that term away and I get a less than or equal to 2 and what is this term now?
That's exactly lambda k of G, right? So I just proved to you that if you remove any set of edges
the eigenvalue is only going to increase. Each eigenvalue can only decrease. It cannot increase.
So what does this mean? So when I'm making the ieth [phonetic] cut in my recursive algorithm
I can bounded by lambda i of the original graph which is at most lambda k of the original graph.
So each cut I can upper bound by square root lambda k times size of the smaller side. So the
total fraction of edges that I'm cutting is at most 2 times square root lambda k times
summation over i of Si. So I should have said this before. So the notation I'm going to use in
this talk is that Si is going to be the smaller side of the cut and Si complement is going to be the
larger side of the cut. So the total fraction of edges that I'm cutting is square root lambda k
times summation over i of cardinality of Si. So I just need to show that now that the summation
of over i of the cardinality of Si is at most the size of the graph times log k. If I show that, then
I'm done. So this is a simple counting argument. To show this I'll construct a tree on k nodes of
size log k which has the property that if you look at any level the sum on each of those levels is
going to be at most order E. So I'm going to construct a tree on k nodes. On each of these
nodes I'm going to put one of the Si’s, so I [inaudible] trees of size log k and the summation of
weights on each level is something like order size of the graph. If I can show such a tree than
I'm done, because that shows that the summation of the Si is at most E times log k. So as a first
attempt let me use these roots to construct a tree. So I'll make Si the child of Sj if Si is obtained
by cutting the smaller side of the previous cut. Otherwise, I'll make it the sibling of Sj. So let
me illustrate it with a picture over here. So in this graph, so I'll put V as the root of this tree and
the first cut I make is a trivial cut so let me just make it the child of V. now the second cut that I
make is obtained by cutting S1 complement. So using the second well I'll make it a sibling of S1.
Now the third cut that I'm making is obtained by cutting S2, so by the first rule I'll make it a
child of S2. In the fourth cut I'm cutting is obtained by cutting S1, so again, by using the first
rule, I'll make it a child of S1. So at this point you can probably see where I'm going with this,
right? If you look at any level of the sets that I'm putting there are already disjoint, so the sum
of weights at each level is at most E by construction. But as you can probably imagine the
height of this tree could be huge. I mean it could be as large as k if each time you were cutting
the smaller side of the graph. But note that the tree that I constructed has this property that
every node is at most half the weight of its parent. So I can use this to shrink the size of the
tree. So suppose by tree looks like this. Suppose my tree has a long [inaudible] in it. I know
that each of these nodes is at most half its parent because it was obtained by cutting the parent
and I'm only putting the smaller side of each cut in the tree. So suppose I were to shift all of
these things up, so how much is this thing going to increase by? The summation of all these
blue things is at most the largest blue thing over there. So at any level the summation is at
most going to double. It's still going to be at most 2 times the size of the graph. And now the
tree also has the property of every non-leaf node has degree at least 2 so its height is also log k.
Yeah, so therefore we are done. So the summation of all these Xi’s is at most E times log k, and
as I showed you before the total fraction of edges we cut is now some order lambda k times log
k. Any questions?
>>: [inaudible]
>> Anand Louis: Yeah, it's not that.
>>: What is the lower bound [inaudible] because doesn't it need a lower bound?
>> Anand Louis: Yeah, so the only lower bound you can prove is lambda 2, unfortunately. So
this is not tight at all.
>>: But you know if phi sum is zero then lambda k is also zero, right?
>> Anand Louis: Yeah.
>>: So can't you put a function, some function of lambda k over there?
>> Anand Louis: Well you can put something like lambda k over k if you wanted, but that would
be very small. So you cannot put a lower bound but a tight example is that -- there is no tight
example, but there exists a family of graphs for which this value is square root lambda k log k.
The log k is under the square root.
>>: Okay so…
>> Anand Louis: So this is not tight.
>>: So in a sense square root lambda k of log k everything else is tight. You can't control that
part.
>> Anand Louis: Yes. You cannot improve that part. So yeah, another thing is that any of
these, the partitions that you obtain here could have, the individual sets themselves could have
very bad expansion, but in general you can't say anything about their expansion. So that's what
I'm going to talk about next. So in the graph G if you pick any k [inaudible] disjoint subsets than
a least one of them will have expansion lambda k. Okay? So this is similar to the easier part of
Cheeger's inequality where you show that the expansion of the graph is at least lambda 2, and
probably the more interesting part is that there exists a one minus epsilon k partition, S1, S2 up
to S1 epsilon k, such that each set has expansion at most square root k lambda log k. Now this
is similar to the harder part of Cheeger's inequality where you show that the expansion of an
edge, the expansion of a set is at most square root lambda 2. And yeah, in this case these two
bounds are tight. The lower bound is tight if you take the Boolean hypercube and the upper
bound is also tight for -- oh, I should also say that this theorem was independently proven by
Gharan, Lee and Trevisan also. Okay. So the upper bound is tight for what is known as the
noisy hypercube. So basically, you take the vertices of a Boolean hypercube, k dimensional
Boolean hypercube and put a complete graph on it and the weight of an edge x,y is like epsilon
to the Hamming distance between x and y. So at this point it's easy to show that the first k
eigenvalues are at most epsilon and with a little bit of work you can also show that any set of at
most size 1 over k will have expansion at least square root epsilon log k. So this follows from
some Fourier analytical tools and most exactly what is known as the Reverse Bonami-Beckner
inequality. So if you find a k partition then at least one of these sets is going to be smaller than
1 over k. Therefore this is really the best you can prove for a k partition. Okay? So first let's do
the lower bound. The lower bound is the easy part. So you want to find, suppose you let S1 to
Sk be some k disjoint subsets and, again, if you know that lambda k is obtained by minimizing
over rank k subspaces, and within that subset maximizing x transpose Lx divided by x transpose
x. So what is the T that you're going to plug-in over here? Well, there is only one thing you can
do, right? So you take T to be the span of the characteristic vectors of S1 through Sk and this is
going to have rank k and at this point you can essentially show that the vector which maximizes
this [inaudible] has to look like one of the character [inaudible] vectors of the set or something
very similar to that. So that is roughly the main idea of the proof. Okay? Okay. So for kpartition I told you that you can obtain one minus epsilon k partition such that your expansion
is like poly one over epsilon times square root lambda k log k. So you really cannot do away
with the poly one over epsilon over here. You really need that thing over there, you know,
because, because in general if you wanted an exact k partition then there exists a family of
graphs for which max of i over phi of Si can be much, much larger than square root lambda k.
You know, as large as k squared times, k squared divided by square root n. So the family of
graph essentially looks like this. You start with the k clicks and a central vertex and then you
add edges between the center vertex and every other vertex individually with probability p. So
I get an upper bound on lambda k from the inequality I showed you in the previous slide. In this
case that's a small expansion, therefore lambda k has to be small.
>>: [inaudible] did you say?
>> Anand Louis: Huh?
>>: What are the…
>> Anand Louis: The Si’s are clicks, complete graphs.
>>: Then what's the point of adding edges [inaudible]
>> Anand Louis: No. I'm adding edges between this central vertex and every other vertex with
probability p.
>>: Yes. Why don't you have to serve a number of edges? What is the difference?
>> Anand Louis: Okay. So you can add -- well I wanted to make it…
>>: [inaudible] random edges to [inaudible]
>> Anand Louis: No. I wanted to make it an unweighted graph. If you are fine with weighted
graphs then you can just add, put weights of edge p over here.
>>: Okay.
>> Anand Louis: No. I just wanted to make it an unweighted graph so that was the point
behind this.
>>: The point is that [inaudible] identical, right, so [inaudible] random you can just add the
[inaudible] edges [inaudible] first.
>> Anand Louis: First?
>>: [inaudible]
>> Anand Louis: No. If you add edges to one vertex over here…
>>: No, no. To the [inaudible].
>> Anand Louis: Over here?
>>: Yeah, because all of them are the same, so you can [inaudible] whatever.
>> Anand Louis: Yeah. Well, so if you add vertices -- so I really want this to be uniform because
if you do something like that, add vertices to some vertex over here then -- so if you do it like
this then it's easy to argue what this value is going to be. It's easy to prove a lower bound on
this. I mean it should work if you do it that way as well. I'm just saying, you know, regular
things look nicer.
>>: Like are you thinking of them as weights?
>> Anand Louis: Yes. Think of them as weights. Think of this, all these edges having weight p.
>>: No. No. The [inaudible] that's a different [inaudible] [multiple speakers]
>>: That's a different graph, right? Than having a…
>>: It's not a graph. You're giving weights to the edges. That's…
>>: So that's a single graph, but [inaudible] graphs but all of those graphs are kind of more or
less isomorphic.
>> Anand Louis: Okay. So this is assuming that these edges have a p if you're fine with
weighted graphs.
>>: [inaudible]
>> Anand Louis: Okay. So in that case we can show that -- so the center in the k partition that
you're going to produce you will have to put the center vertex in some piece, right? And
whichever piece you put it in you are going to pay a huge expansion because the center vertex
has a lot of edges [inaudible] on it. So when you put it on that side, phi of Si with the central
vertex is going to be like k times the expansion of the rest of the pieces and this fact is enough
to get a huge gap over here by appropriately choosing the value of p. And so the point that I
wanted to make is that you really need this poly one over epsilon here. You cannot play some
small constant over there. As you get closer and closer to k that thing has to blow up. Okay?
Okay. So let's see how to prove this. So Cheeger's analysis essentially shows that you don't
need to only start with the second eigenvector. You can start with any vector X which has small
support and you can obtain a set S which lies in support of X whose expansion is that most this
quantity over here. Okay? So this quantity I would like to think of it as the average distortion
of the Ll1 embedding given to this vector. So the embedding is an economical embedding,
right, where you map vertex i to the value Xi. So then this term Xi minus Xj is like the distortion
of this edge and with this term is like under some weird normalization this is like the average
distortion of this embedding, so Cheeger's analysis essentially says that you can start with any
vector and find a vector set s which lies in support of this vector and whose expansion is at
most the average distortion. So if I can find phi dot k such vectors each having disjoint support
and each having average distortion, small average distortion then I'm done, right, because I
plug in this lemma to each of these vectors and each of them will give me a set. So that's going
to be my main goal for the rest of this talk, to come up with k dot k such vectors which have
disjoint support and have small average distortion. Okay? So I don't know how to start with
such vectors, but the top k eigenvectors give me an embedding into R to the k, an economical
embedding where you map vertex i to the vector corresponding, consisting of the ith
coordinated feature of the eigenvectors. Okay? So this embedding satisfies some very simple
properties. First one says that the summation of lengths of these vectors is equal to k, right, so
this just follows from a fact that you have k eigenvectors and each of them having length one.
And probably the most interesting property is this, that the average in the product square is
equal to k. So this again follows in the fact that the k eigenvectors are orthogonal to each
other, but the reason why I want you to notice this is that -- this is a very strong property. It is
not satisfied by say some random vectors in R to the k. If you were to pick some random
vectors in R to the k and normalize it in this form you would get that the average in the product
square is like k square. So this thing being equal to k indicates somehow that there are already
some clusters among these vectors, which we will crucially exploit later on. And this thing is
also easy to see that the average distortion of this embedding is at most lambda k, right? Well
this follows from the fact that each coordinate has distortion at most lambda k because they
are the first k eigenvalues. All right? Any questions? Okay. So the rounding algorithm is very
simple, probably, very simple. You pick g random Gaussian vectors, g1 through to gk. Let X1 be
the prediction of all the vi’s on g1 and let X2 be the prediction of all Vi’s on g2 and so on. And
let xk be the prediction of all the vi’s on gk. And we want to make these vectors disjoint
support, right? So what's the most obvious thing you would do? Go to each row and zero out
all but the largest value. Okay? So let me look at the first row. Let's say the first value is the
largest value, so I keep that and zero everything else in the first row. Then I go to the second
row, keep whatever is the largest value and then zero out everything else. And I do that for
each row. So yeah, so now I have k vectors, each of which have disjoint support and now I need
to show that all of them or some fraction of them have low distortion. So is the algorithm clear
to everyone, or any questions? Okay. So why does this algorithm work? An intuitive reason
why this works is if you think about the blue vectors as the Vi’s and the brown vectors as the
Gaussian vectors that you picked, then essentially what the algorithm is doing is that the
support of your vectors is going to look like, the support of x1 is going to look like this thing, this
cluster around g1, right, roughly or mostly. And the support of g2 is going to look like the
cluster around g2 and so on.
>>: [inaudible] the square distances.
>> Anand Louis: Yeah. So I'm finding these clusters in, so I'm just saying that this algorithm that
I did before is going to find these clusters, roughly, or some large fraction of this. And which is
roughly what we want to do because I want to find these clusters and if the clusters are kind of
close together then they would give is like good sets and so on. And that's what I'll prove.
Okay? Okay. So roughly the analysis is going to go like this. So recall that we show that, so this
is the term average distortion and I want to show that this is small for a constant fraction of the
indices. So I'll show you that the expected value of the denominator of each of these vectors is
like log k and the expected value of the numerator is like square root lambda k log cubed k.
Okay? So the ratio is like what we want. So is this efficient for us? Probably not, right? So I'll
also show you that the expected value of the ratio is bounded by some constant times the ratio
of the expected value for some constant fraction of indices, which is like square root lambda k
log k. Okay? So let's do the denominator first. That is the easy thing. So again, keep this
picture in mind. So this is the first entry and I did not zero this out and I zeroed out everything
else from the first row. So what is the property that I did not zero this out? Well, all of the gi’s
are independently chosen, so any of those could have been the largest. Probably that this is
the largest is 1 over k, right? And, okay. Let's say that I did not zero this out with property one
over k and suppose I know that this is the largest in this row, what does its expected value look
like? So this is a simple exercise; so if you were to pick k standard normal random variables and
look at what is the expected value of the maps, it is -- I'm sorry, less than or equal two log k.
Okay? So keeping this in mind I get that the expected value of the first entry, or the expected
value of the ith entry in the first column is vi squared log k on the condition that it being nonzero, condition on it being larger than zero. Okay. So it's all expected value is exactly this term
divided by k. The expected value of the ith term is normal Vi squared log k and divided by k. So
to compute the expected value of the denominator I just sum this over all entries, right? And
recall that I have normalized the Vi’s in such a way that summation over i of di, vi square was
exactly equal to k. So the denominator is exactly equal to log k. Okay? Any questions? So this
was easy part. So the numerator requires a bit of work. So the numerator consists of terms
which look like Xi and Xj, right? Xi minus Xj, and as before I can calculate the expected value of
Xi assuming Xi is nonzero and I can calculate the expected value of Xj when Xj is nonzero. So
your Xi and Xj are like normal Vi squared times this, Vi tilde, g1 squared if this thing is the
largest in its row. Again, so throughout all this I want to keep this picture in mind. This is the
projections and this is the thing that I did not zero out. Okay? So Xi is like normal Vi squared
times this random variable if it is largest in this row; otherwise it's zero, and same thing as Xj.
So I need to calculate, there are four cases here, case one both of them are zero, in which case I
don't care if both of them are zero. Case two is when both of them are nonzero. Well, I'd ask
you to believe me on this but the probability of that happening is small, so that is probably not
something that you should worry about. The most interesting case is when one of them is zero
and one of them is nonzero. So we need to bound the probability of that event happening. So
this is what it looks like. So there's this unit here, Vi tilde [inaudible] Vj tilde R, vectors on these
units here and I want to calculate the probability that Xj is greater than or equal to zero and Xi
is zero, right? So suppose g1 were to look something like this. So when would Xj be nonzero?
So this would roughly be nonzero when Vj tilde aligns itself, aligns very well with g1, right? So
this, keep this picture in mind. So this was my rounding algorithm. So when would V1, g1 be
the largest in its row? You know, when it's the largest among all of these projections, that
means Vi is aligning very well with g1, or at least better than its aligning with all of the other
gi’s, right? So if Xj is going to be nonzero then it means that most probably it is going to lie in
this shaded region around g1 over here. Because if you could say that, if it does not lie in this
shaded region then with high probability it is aligning much better with some other gi, okay? So
to calculate the probability that Xj is greater than zero and Xi is nonzero, so that should depend
on the distance between these two vectors, right? If these two vectors are already close to
each other then they would behave the same, then these two random areas would behave the
same. If these two vectors are orthogonal then they just behave independently, right? So
using this you can show that the probability of a cut being made in the sense that probability
that one of them is zero and one of them is nonzero is upper bounded by 1 over k times the
distance between these two vectors times square root lambda k. So the square root lambda k
comes because it is like the size of the ith cap on the units here. Okay? Oh and this was
formally proven by Charikar, Makarychev and Makarychev. Any questions? Okay. So moving
on, now I'm ready to calculate the expected value of Xi minus Xj. When Xi is zero and Xj is
nonzero the expected value of Xj looks like this, Vj squared times log k and the probability of
that event happening is this and when the opposite event happens, Xi is nonzero, Xj is greater
than zero the expected value looks like this, and again the probability of that happening is this,
right? So well, I asked you to bear with me on this slide. This is the only [inaudible] slide that
has math on it, this much math on it. So by doing some sort of simple rearrangement I can
upper bound it by this term, 1 over k times the sum of length times the distance between them
times the log cubed k, so I'm collecting this square root log k and this log k. Okay? So at this
point whenever I see a [inaudible] that looks like this it's sort of asking you to plug-in CauchySchwartz inequality, right? You sum this over all edges and apply Cauchy-Schwartz over there.
So what do you get? The one over k stays as is. You get something like Vi minus Vj all squared
summed over all edges times summation over i Vi squared, right? And what was this value? So
if you remember this picture that I showed you before, the summation of the edges of Vi minus
Vj square is lambda k times this thing and this thing is equal to k, so summation over i,
summation over edges of Vi minus Vj whole square is lambda k times k. So let me plug that in
over here so this is lambda k times k and this thing also is just k, so the square root k and k
cancels and you are left with square root lambda k log cubed k. Okay? So any questions?
Okay. So let's do a quick overview, so we bounded the probability that Xi, one of them is zero
and the other one is nonzero which was this, and we know how to calculate the expected value
of one of them and the expected value separately. And I didn't show you what to do when
both of them are nonzero, but that is a very simple case and it happens with very small
probability, so you can ignore that. So once you know how to do this you just sum them over
all edges like Cauchy-Schwartz and you are done. Okay? Okay. Until now I showed you that
the denominator is log k and the numerator is the square root lambda k log cubed k. And now
I'll show you that for constant fraction of the indices, the expected value of the ratio is bounded
by some constant times the ratio of the expected values, which is square root lambda log k.
Okay? So again, in this part we will crucially use this property that we have for our embedding,
that the average value of the inner product square is like one over k, and so this already
indicates that there is a high correlation into clusters. So as a [inaudible] experiment consider
the following case when any two of the Vi’s are either the same or orthogonal to each other.
So any two Vi’s and Vj’s are either equal or orthogonal to each other. So in this case I claim that
the rounding algorithm is like throwing k balls into k bins. So what are the balls and bins over
here? So the k bins are the k vectors that you have and the k balls are the k clusters of the Vi’s
that you have, and why is this like throwing balls into bins? So if you recall what our rounding
algorithm did, so you took predictions of some Vi onto the gi’s and whichever one was the
largest you sort of assigned it to that vector, right? So the vectors are like the bins and I say
that my ball has gone into some particular bin if the vector that I picked is largest in that -- if the
Vi that I picked is the largest in that gi, right? So the rounding algorithm is sort of like throwing
k balls into k bins and what happens when you throw k balls into k bins? You would expect that
something like 1-1 over reflection of them are nonempty, right?
>>: [inaudible] you don't have independence, right? You have independence.
>> Anand Louis: No. So that's why I say you assume for this experiment, consider for this
experiment where the two Vi’s are either equal or orthogonal to each other. So in this case
when one vertex goes the whole cluster goes with it and since they are orthogonal they are
independent. This is just intuition. And so in this case you get that constant fraction of the
denominators are large, you know, like whatever one ball goes into a bin, that bin is like large.
And so this is the intuition behind trying to show that the denominator is large but
unfortunately this does not work. It would have been -- we could not make it work. It probably
works. I don't know. We have to go through the boring way of trying to bound the variance of
the denominator. Okay? So what does the variance look like? So the variance is the
summation of expected values of Xi and Xj. And what does this term look like? So we know
what is the expected value of Xi when it's nonzero. It's something like Vi squared times log k.
We know the expected value of Xj when it's nonzero. It's something like Vj squared times log k.
And what we need to bound is the probability that both of them are nonzero, right? So again,
remember that, recall that Xi was this random variable if the projection on g1 was the largest in
that row, similarly Xj. So I'll show you that the probability that both of them are nonzero can be
upper bounded by the inner product of vi tilde and vj tilde all square divided by k +1 over k
square. Okay? So again, think of this picture in mind. So again, let's consider two boundary
cases when vi tilde and vj tilde are orthogonal to each other, okay? So in that case what is the
value of this probability? Now if vi tilde and vj tilde orthogonal to each other then these two
random variables are independent random variables, right? Xi and Xj are totally independent
random variables, so the probability that both of them are greater than zero is 1 over k square.
And the other boundary case to look at is say when vi tilde is equal to vj tilde. So in that case
what is the probability that both of them are greater than zero? It's just one over k, right,
which we have over here. So essentially what this statement is saying is that in the general case
is a convex, combination of the two boundary cases. The two boundary cases being then once
they are both completely independent then one when they are like completely correlated. So
proving this requires a bit of work and I'm not going to prove it for you. You just have to
believe me. So roughly a high-level idea is that you can think of each Xi as a function from the
Gaussian space to the real line, because each Xi is a function of the vi and all the g1 through the
gk. And you can write this function in terms of its hermite basis. The hermite form, they form a
complete eigenbasis for the Gaussian space, so this and some Fourier analytic tools on top of
this gives you this statement essentially. So but yeah, proving that requires some amount of
work. But anyway, let's assume this and try to finish the proof of the variance. So I have that vi
square, normal vi square times vj squared times log square k times this term over here. So
rearranging summation I can right this as inner product of vi vj whole squared all divided by k
plus summation over i of vi squared whole squared whole squared, right? So what are these
two values that we had? So again, going back to this picture, note that summation over ij of vij
whole squared is equal to k and summation over i of normal vi squared is equal to k. So if I plug
this in over here, what do I get? This equal to one. This is also equal to one, so your variance is
this 2 log square k. Any questions at this point? So I guess the most crucial part here is this is
where we crucially use the fact that the algorithm, the totally inner product square is small, so
the sort of indicates that your vectors are strongly correlated and this is where we are using this
correlation. So if we were to pick some n random vectors in k dimension they do not satisfy this
bound and that is what we are crucially using here. So at this point we are sort of done, right? I
showed you that the expected value of the numerator is square root lambda k log cube k, so
using Markov inequalities something like 99% of the numerators are smaller than this, smaller
than some hundred times this. And I can bound the denominator by using Chebyshev’s
inequality, right, so the expected value of the denominator is log k. The variance is log square
k. So the probability that the denominator is at least log k over 2 is something like one fourth.
So a constant fraction of your denominators are large, therefore taking the union bound over
these two events there are at least like k over 8 such indices for which the numerator is small
and the denominator is large and that's it. So we are done. Any questions?
>>: [inaudible] take care of all the [inaudible] constant [inaudible]
>> Anand Louis: Right, so well I've actually proved a weaker theorem for you. I told you that
you can get one minus epsilon times k, so to do that you need to do this like much, much
carefully, but right now this proof only shows that you get some constant times k sets. So that
requires pretty much the same idea but a much more careful analysis. Okay? So you can also
asked this computational question that given a parameter k, I want to find the best k partition
so as to minimize, minimize the maximum expansion. So this can be thought of as like a
extension of this sparsest cut problem. So this is actually a very tricky problem because, you
know, any of the standard STP formulations that you try they can have a unbounded inner
[inaudible] gap. They can be as large as you want them to be, so standard [inaudible] gap
proves that you would have like one indicator vector to indicate like which set. You'd have like
k vectors for each vertex to indicate which set that your vertex belongs to and you can really
show this has a [inaudible] gap as big as you want. And so I did not say this in much detail but
[inaudible] came up with an interesting STP formulation which inherently takes care of all of the
problems that we seem to have with the standard STP formulation and using a similar rounding
algorithm we can, we can get an algorithm that outputs one minus epsilon times k sets each
having expansion of times square root log n log k. So [inaudible] noted so any upper bound, so
an approximation algorithm for this does not imply anything for the min sum partitioning.
Okay? So yeah, that's all I have to say. And probably the most stressing question is can you get,
can you lose this [inaudible] units in the approximation that we had. Can you get like exactly k
sets, k non-empty disjoint subsets is that each has expansion square root lambda k log k, right?
>>: Your [inaudible] gap is the gap you show [inaudible] slides [inaudible] example.
>> Anand Louis: No. There I was, that was for a partition, so right now I'm just saying that
there should be k non-empty disjoint subsets. I'm not insisting on a partition, so this could just
be k nonempty disjoint subsets. So if you remember the example that I showed, the k clicks
would themselves be these k sets that you want. And, so another problem that is of interest
related to this is a small set expansion problem where given a parameter k you want to find a
set of size n over k which has the least expansion. So what I showed you before is that, so what
I showed you today also implies an upper bound for small set expansion, right? So if you find a
k partition then one of those sets is going to be smaller than n to the -- it's going to be smaller
than n over k. So for small values of k that's fine, but for large values of k it is, it can be
improved. So like Arora, Barak and Steurer showed that you can upper bound this, you can find
a set of size n over k whose expansion is at most lambda sub k to the c where c is some
constant like 65 times the square root log n base k. So here’s an interesting [inaudible] k is like
some n to the epsilon where you essentially lose the log n to the base k factor here. So if you're
interested in the like sub exponent share time algorithms for unique games, then the k that you
are going to plug-in is going to be something like n to the epsilon, so in that regime this is more
interesting than the theorem I showed you. And that's it. Thanks for coming and I would be
glad to take questions.
>>: Here you always examine the number of edges from Si to the rest of the graph. Are there
any results in which you care about the edges between Si and Sj? Or all of there’s ij which
would be much more…
>> Anand Louis: That would like some regularity partition.
>>: I don't know what you mean by that.
>> Anand Louis: So, so that's acting like similarity regularity and stuff. Yeah, I haven't seen any
problems like the ones that you describe.
>>: [inaudible] what you want is small expansion into each which is a much [inaudible]
condition.
>>: So you want to look at the maximum case where there's pairs and minimize that.
>>: For example. [inaudible] I take some normal whatever [inaudible]
>>: What you want in this case is take [inaudible]
>>: Yes, yes exactly. Even just taking, having [inaudible] is much harder.
>> Anand Louis: Yeah, I doubt if it has any connection to the eigenvalues, but…
>>: Why the obsession with eigenvalues? I mean it's nice to have okay, but why? What's that
mean? What’s it mean to connect I get into the k eigenvalues to partitions or to learn about
partitions?
>> Anand Louis: Well the reason why we started looking…
>>: [inaudible]
>> Anand Louis: Hum?
>>: I thought your aim was to learn about partitions, so [inaudible]
>> Anand Louis: Right, so the reasons why we started looking at this problem…
>>: [inaudible] one method [inaudible] eigenvalues.
>> Anand Louis: Yes. I'm saying that.
>>: It's not compulsory.
>> Anand Louis: Right. But yeah, the reason why we started looking at this is because of its
connection to unique games, so I, so then like I said if you get like better bounds for this you
could get like slightly better algorithms for unique games. So that is why, you know, all of this,
that is why this expansion problem gained fame in the first place.
>>: Also, would say if you are interested in [inaudible] eigenvalues [inaudible] so that's what
people [inaudible] and practice [inaudible] to partition data [inaudible]
>>: And so the same [inaudible] in place for the methods [inaudible]
>>: [inaudible] do things that seem to be computationally [laughter]
>>: I know. I know, but it's, the answer is we do it because we can.
>>: Yes. [inaudible] because maybe [inaudible] cures these algorithms really [inaudible] so well
that it becomes [inaudible] something [inaudible]
>>: More questions? Maybe more suggestions for your problems? Is that even better?
[inaudible] let's thank the speaker again. [applause]
Download