>> Anand Louis: So I talk about higher-order Cheeger’s inequalities and, for graph multipartitioning problems. So firstly what is graph partitioning? And so you are given a graph and you want to partition it into some pieces and generally a measure of how good the partition is some function of the edges that go between the pieces and the sizes of the sets. So one of the most well studied problems is the sparsest cut problem where you are given a graph and for any set S we define its expansion as the ratio of the edges that go out of the set divided by the size of the set. And we denote this by phi of S and the expansion of the graph is the minimum value of phi of S over all sets of S size at most n over 2. Right, clear? So this is a fundamental NP hard problem and it is sort of been of interest to the Markov chain people because if you were to do something like a random walk then the missing time of the random walk depends on the expansion of the graph and many of these divide and conquer algorithms go first find the sparsest cut parses on each pieces and so on. So trying to compute the sparsest cut is very important NP hard problem. So there's this thing called Cheeger's inequality which is a very fundamental way of measuring the expansion of the graph. So the Cheeger's inequality says this. So you look at the Laplacian matrix which is, for the d regular graph it is the identity matrix minus the adjacency matrix divided by d. So see this matrix is symmetric and diagonally dominant. It's easy to see that the smallest eigenvalue of this matrix is zero. So you just take the eigenvector with the [inaudible] vector and that is an eigenvector with eigenvalue zero. And the very simple exercise to show is that if your graph has something like k connected components then their first key eigenvalues are all zero. Okay. So what Cheeger's inequality says is that if you look at the lambda two, the second smallest eigenvalue and the expansion of the graph is at least, lambda two and at most two times square root lambda two, and the proof of Cheeger's inequality also gives you an algorithm to find a set which satisfies this upper bound. So you sort the entries of the second eigenvector in decreasing order. Let's say x1 to xn, x1 being the largest and xn being the smallest and you look at the cuts defined by this ordering. So Si is the set consisting of the first [inaudible] in this ordering. Then the proof essentially says that, the minimum of phi of Si over says that at most 2 square root lambda 2. So it gives you a simple algorithm to find a set that satisfies this upper bound. Okay? So in this talk I'll talk about two problems. You are given a parameter k and the first problem is to partition the graph into k pieces so as to minimize the max over i of phi of Si. And the second seemingly easier problem is to find a k partition that minimizes the total fraction of edges that you cut. Okay? So it's easy to see that whatever upper bound you prove for the first problem will also be in upper bound for the second problem. So the second problem is an easier problem, so let's do that first. >>: [inaudible] fraction [inaudible] cut? >> Anand Louis: Yeah. >>: But is it easier? >> Anand Louis: I mean so whatever upper bound to prove for the first problem is an upper bound for the second problem, right? For the first problem I want you to find the k partition Si to Sk, so as to minimize the max over i of phi of Si. >>: Really I mean in one case you partition [inaudible] say you maximize [inaudible] cut about the single number [inaudible] care about k number. Not in this particular case but there are many questions the next [inaudible] where you want to [inaudible] always around that same [inaudible] would like to know how many edges would remain [inaudible] much harder. >>: I know upper bound is not in terms of [inaudible]… >> Anand Louis: Right, right. I'm just saying [inaudible] upper bound. Something like in terms of function of k or lambda k or something. >>: The objective you mean or the [inaudible] when you say upper bound is it a solution you're talking about or the number you are talking about? >> Anand Louis: The number. So the total number of -- so suppose you prove that this term is at most say some alpha, then I'm saying that the upper bound here is also at most alpha. >>: [inaudible] [multiple speakers] >> Anand Louis: Uh-huh. I'm just talking about absolute numbers, not approximate [inaudible]. >>: [inaudible] >> Anand Louis: So for the first problem, so there's a very simple recursive algorithm, right? To the Cheeger's inequality gives away to find one cut, so you find that cut, remove those edges and add self loops in their place. For each edge that you will remove, you add a self loop at [inaudible] and a self loop at [inaudible]. And then you repeat. Now we have two pieces. Look at which piece has a smaller second eigenvalue and do this again until you get k pieces. I'll show you this very simple algorithm cuts at most square root lambda k times log k fraction of edges. And so this is an absolute upper bound. It's not like an approximation factor or anything. Okay? So to prove this the first interesting observation is that if you remove any edge from the graph, your eigenvalues are only going to decrease. So I'll prove this for you and it's a very simple thing. So first note that, so let c be the set of edges that you will be removing. Then the adjacency matrix of G minus c minus the adjacency matrix of G is a diagonally dominant matrix. >>: [inaudible] second eigenvalue this is Laplacian? >> Anand Louis: Yes, of course. So do you see why the adjacency matrix of G minus c minus the adjacency matrix of G is a diagonally dominant matrix? So which entries do these two matrices differ in? Only the entries corresponding to the edges in c, right? Everything else is the same. So in this difference you'll get a -1 for every edge that you removed, but for every edge that you remove, you also added self loops and [inaudible] the endpoints, so you'd get a plus one in the diagonal entries corresponding to them. Right? So this is a diagonally dominant matrix and all diagonally dominant matrices are also… >>: Sorry, silly question but diagonally dominant matrix is something that values the diagonal meet the sums of things? >> Anand Louis: Yeah, so for every… >>: [inaudible] you have degrees on the diagonal? >> Anand Louis: Yeah. So diagonally dominant I mean not the eigenvalue entry is larger than the sum of the [inaudible] and larger than some of the [inaudible] column. >>: [inaudible] >> Anand Louis: Yeah. >>: So when you normalize [inaudible] after you [inaudible] no longer… >> Anand Louis: No, I'm adding self loops in their place so it's just still d regular. So when I am counting the degree I am counting the self loops also. >>: [inaudible] >> Anand Louis: So it's d regular, so it's diagonally dominant and positive semi-definite. Okay. So let k be an arbitrary number. So the k eigenvalue of G minus E is given by, you minimize over the subspace of rank k and within that subspace you maximize x transpose the Laplacian of G minus C times x divided by x transpose x. Right? So let me just reel in the summation and write it this way. So the Laplacian of G and G minus C differ in exactly this [inaudible]. And what does this, so this is a PSD matrix, right? So this term is always going to be non-negative so let me just throw that term away and I get a less than or equal to 2 and what is this term now? That's exactly lambda k of G, right? So I just proved to you that if you remove any set of edges the eigenvalue is only going to increase. Each eigenvalue can only decrease. It cannot increase. So what does this mean? So when I'm making the ieth [phonetic] cut in my recursive algorithm I can bounded by lambda i of the original graph which is at most lambda k of the original graph. So each cut I can upper bound by square root lambda k times size of the smaller side. So the total fraction of edges that I'm cutting is at most 2 times square root lambda k times summation over i of Si. So I should have said this before. So the notation I'm going to use in this talk is that Si is going to be the smaller side of the cut and Si complement is going to be the larger side of the cut. So the total fraction of edges that I'm cutting is square root lambda k times summation over i of cardinality of Si. So I just need to show that now that the summation of over i of the cardinality of Si is at most the size of the graph times log k. If I show that, then I'm done. So this is a simple counting argument. To show this I'll construct a tree on k nodes of size log k which has the property that if you look at any level the sum on each of those levels is going to be at most order E. So I'm going to construct a tree on k nodes. On each of these nodes I'm going to put one of the Si’s, so I [inaudible] trees of size log k and the summation of weights on each level is something like order size of the graph. If I can show such a tree than I'm done, because that shows that the summation of the Si is at most E times log k. So as a first attempt let me use these roots to construct a tree. So I'll make Si the child of Sj if Si is obtained by cutting the smaller side of the previous cut. Otherwise, I'll make it the sibling of Sj. So let me illustrate it with a picture over here. So in this graph, so I'll put V as the root of this tree and the first cut I make is a trivial cut so let me just make it the child of V. now the second cut that I make is obtained by cutting S1 complement. So using the second well I'll make it a sibling of S1. Now the third cut that I'm making is obtained by cutting S2, so by the first rule I'll make it a child of S2. In the fourth cut I'm cutting is obtained by cutting S1, so again, by using the first rule, I'll make it a child of S1. So at this point you can probably see where I'm going with this, right? If you look at any level of the sets that I'm putting there are already disjoint, so the sum of weights at each level is at most E by construction. But as you can probably imagine the height of this tree could be huge. I mean it could be as large as k if each time you were cutting the smaller side of the graph. But note that the tree that I constructed has this property that every node is at most half the weight of its parent. So I can use this to shrink the size of the tree. So suppose by tree looks like this. Suppose my tree has a long [inaudible] in it. I know that each of these nodes is at most half its parent because it was obtained by cutting the parent and I'm only putting the smaller side of each cut in the tree. So suppose I were to shift all of these things up, so how much is this thing going to increase by? The summation of all these blue things is at most the largest blue thing over there. So at any level the summation is at most going to double. It's still going to be at most 2 times the size of the graph. And now the tree also has the property of every non-leaf node has degree at least 2 so its height is also log k. Yeah, so therefore we are done. So the summation of all these Xi’s is at most E times log k, and as I showed you before the total fraction of edges we cut is now some order lambda k times log k. Any questions? >>: [inaudible] >> Anand Louis: Yeah, it's not that. >>: What is the lower bound [inaudible] because doesn't it need a lower bound? >> Anand Louis: Yeah, so the only lower bound you can prove is lambda 2, unfortunately. So this is not tight at all. >>: But you know if phi sum is zero then lambda k is also zero, right? >> Anand Louis: Yeah. >>: So can't you put a function, some function of lambda k over there? >> Anand Louis: Well you can put something like lambda k over k if you wanted, but that would be very small. So you cannot put a lower bound but a tight example is that -- there is no tight example, but there exists a family of graphs for which this value is square root lambda k log k. The log k is under the square root. >>: Okay so… >> Anand Louis: So this is not tight. >>: So in a sense square root lambda k of log k everything else is tight. You can't control that part. >> Anand Louis: Yes. You cannot improve that part. So yeah, another thing is that any of these, the partitions that you obtain here could have, the individual sets themselves could have very bad expansion, but in general you can't say anything about their expansion. So that's what I'm going to talk about next. So in the graph G if you pick any k [inaudible] disjoint subsets than a least one of them will have expansion lambda k. Okay? So this is similar to the easier part of Cheeger's inequality where you show that the expansion of the graph is at least lambda 2, and probably the more interesting part is that there exists a one minus epsilon k partition, S1, S2 up to S1 epsilon k, such that each set has expansion at most square root k lambda log k. Now this is similar to the harder part of Cheeger's inequality where you show that the expansion of an edge, the expansion of a set is at most square root lambda 2. And yeah, in this case these two bounds are tight. The lower bound is tight if you take the Boolean hypercube and the upper bound is also tight for -- oh, I should also say that this theorem was independently proven by Gharan, Lee and Trevisan also. Okay. So the upper bound is tight for what is known as the noisy hypercube. So basically, you take the vertices of a Boolean hypercube, k dimensional Boolean hypercube and put a complete graph on it and the weight of an edge x,y is like epsilon to the Hamming distance between x and y. So at this point it's easy to show that the first k eigenvalues are at most epsilon and with a little bit of work you can also show that any set of at most size 1 over k will have expansion at least square root epsilon log k. So this follows from some Fourier analytical tools and most exactly what is known as the Reverse Bonami-Beckner inequality. So if you find a k partition then at least one of these sets is going to be smaller than 1 over k. Therefore this is really the best you can prove for a k partition. Okay? So first let's do the lower bound. The lower bound is the easy part. So you want to find, suppose you let S1 to Sk be some k disjoint subsets and, again, if you know that lambda k is obtained by minimizing over rank k subspaces, and within that subset maximizing x transpose Lx divided by x transpose x. So what is the T that you're going to plug-in over here? Well, there is only one thing you can do, right? So you take T to be the span of the characteristic vectors of S1 through Sk and this is going to have rank k and at this point you can essentially show that the vector which maximizes this [inaudible] has to look like one of the character [inaudible] vectors of the set or something very similar to that. So that is roughly the main idea of the proof. Okay? Okay. So for kpartition I told you that you can obtain one minus epsilon k partition such that your expansion is like poly one over epsilon times square root lambda k log k. So you really cannot do away with the poly one over epsilon over here. You really need that thing over there, you know, because, because in general if you wanted an exact k partition then there exists a family of graphs for which max of i over phi of Si can be much, much larger than square root lambda k. You know, as large as k squared times, k squared divided by square root n. So the family of graph essentially looks like this. You start with the k clicks and a central vertex and then you add edges between the center vertex and every other vertex individually with probability p. So I get an upper bound on lambda k from the inequality I showed you in the previous slide. In this case that's a small expansion, therefore lambda k has to be small. >>: [inaudible] did you say? >> Anand Louis: Huh? >>: What are the… >> Anand Louis: The Si’s are clicks, complete graphs. >>: Then what's the point of adding edges [inaudible] >> Anand Louis: No. I'm adding edges between this central vertex and every other vertex with probability p. >>: Yes. Why don't you have to serve a number of edges? What is the difference? >> Anand Louis: Okay. So you can add -- well I wanted to make it… >>: [inaudible] random edges to [inaudible] >> Anand Louis: No. I wanted to make it an unweighted graph. If you are fine with weighted graphs then you can just add, put weights of edge p over here. >>: Okay. >> Anand Louis: No. I just wanted to make it an unweighted graph so that was the point behind this. >>: The point is that [inaudible] identical, right, so [inaudible] random you can just add the [inaudible] edges [inaudible] first. >> Anand Louis: First? >>: [inaudible] >> Anand Louis: No. If you add edges to one vertex over here… >>: No, no. To the [inaudible]. >> Anand Louis: Over here? >>: Yeah, because all of them are the same, so you can [inaudible] whatever. >> Anand Louis: Yeah. Well, so if you add vertices -- so I really want this to be uniform because if you do something like that, add vertices to some vertex over here then -- so if you do it like this then it's easy to argue what this value is going to be. It's easy to prove a lower bound on this. I mean it should work if you do it that way as well. I'm just saying, you know, regular things look nicer. >>: Like are you thinking of them as weights? >> Anand Louis: Yes. Think of them as weights. Think of this, all these edges having weight p. >>: No. No. The [inaudible] that's a different [inaudible] [multiple speakers] >>: That's a different graph, right? Than having a… >>: It's not a graph. You're giving weights to the edges. That's… >>: So that's a single graph, but [inaudible] graphs but all of those graphs are kind of more or less isomorphic. >> Anand Louis: Okay. So this is assuming that these edges have a p if you're fine with weighted graphs. >>: [inaudible] >> Anand Louis: Okay. So in that case we can show that -- so the center in the k partition that you're going to produce you will have to put the center vertex in some piece, right? And whichever piece you put it in you are going to pay a huge expansion because the center vertex has a lot of edges [inaudible] on it. So when you put it on that side, phi of Si with the central vertex is going to be like k times the expansion of the rest of the pieces and this fact is enough to get a huge gap over here by appropriately choosing the value of p. And so the point that I wanted to make is that you really need this poly one over epsilon here. You cannot play some small constant over there. As you get closer and closer to k that thing has to blow up. Okay? Okay. So let's see how to prove this. So Cheeger's analysis essentially shows that you don't need to only start with the second eigenvector. You can start with any vector X which has small support and you can obtain a set S which lies in support of X whose expansion is that most this quantity over here. Okay? So this quantity I would like to think of it as the average distortion of the Ll1 embedding given to this vector. So the embedding is an economical embedding, right, where you map vertex i to the value Xi. So then this term Xi minus Xj is like the distortion of this edge and with this term is like under some weird normalization this is like the average distortion of this embedding, so Cheeger's analysis essentially says that you can start with any vector and find a vector set s which lies in support of this vector and whose expansion is at most the average distortion. So if I can find phi dot k such vectors each having disjoint support and each having average distortion, small average distortion then I'm done, right, because I plug in this lemma to each of these vectors and each of them will give me a set. So that's going to be my main goal for the rest of this talk, to come up with k dot k such vectors which have disjoint support and have small average distortion. Okay? So I don't know how to start with such vectors, but the top k eigenvectors give me an embedding into R to the k, an economical embedding where you map vertex i to the vector corresponding, consisting of the ith coordinated feature of the eigenvectors. Okay? So this embedding satisfies some very simple properties. First one says that the summation of lengths of these vectors is equal to k, right, so this just follows from a fact that you have k eigenvectors and each of them having length one. And probably the most interesting property is this, that the average in the product square is equal to k. So this again follows in the fact that the k eigenvectors are orthogonal to each other, but the reason why I want you to notice this is that -- this is a very strong property. It is not satisfied by say some random vectors in R to the k. If you were to pick some random vectors in R to the k and normalize it in this form you would get that the average in the product square is like k square. So this thing being equal to k indicates somehow that there are already some clusters among these vectors, which we will crucially exploit later on. And this thing is also easy to see that the average distortion of this embedding is at most lambda k, right? Well this follows from the fact that each coordinate has distortion at most lambda k because they are the first k eigenvalues. All right? Any questions? Okay. So the rounding algorithm is very simple, probably, very simple. You pick g random Gaussian vectors, g1 through to gk. Let X1 be the prediction of all the vi’s on g1 and let X2 be the prediction of all Vi’s on g2 and so on. And let xk be the prediction of all the vi’s on gk. And we want to make these vectors disjoint support, right? So what's the most obvious thing you would do? Go to each row and zero out all but the largest value. Okay? So let me look at the first row. Let's say the first value is the largest value, so I keep that and zero everything else in the first row. Then I go to the second row, keep whatever is the largest value and then zero out everything else. And I do that for each row. So yeah, so now I have k vectors, each of which have disjoint support and now I need to show that all of them or some fraction of them have low distortion. So is the algorithm clear to everyone, or any questions? Okay. So why does this algorithm work? An intuitive reason why this works is if you think about the blue vectors as the Vi’s and the brown vectors as the Gaussian vectors that you picked, then essentially what the algorithm is doing is that the support of your vectors is going to look like, the support of x1 is going to look like this thing, this cluster around g1, right, roughly or mostly. And the support of g2 is going to look like the cluster around g2 and so on. >>: [inaudible] the square distances. >> Anand Louis: Yeah. So I'm finding these clusters in, so I'm just saying that this algorithm that I did before is going to find these clusters, roughly, or some large fraction of this. And which is roughly what we want to do because I want to find these clusters and if the clusters are kind of close together then they would give is like good sets and so on. And that's what I'll prove. Okay? Okay. So roughly the analysis is going to go like this. So recall that we show that, so this is the term average distortion and I want to show that this is small for a constant fraction of the indices. So I'll show you that the expected value of the denominator of each of these vectors is like log k and the expected value of the numerator is like square root lambda k log cubed k. Okay? So the ratio is like what we want. So is this efficient for us? Probably not, right? So I'll also show you that the expected value of the ratio is bounded by some constant times the ratio of the expected value for some constant fraction of indices, which is like square root lambda k log k. Okay? So let's do the denominator first. That is the easy thing. So again, keep this picture in mind. So this is the first entry and I did not zero this out and I zeroed out everything else from the first row. So what is the property that I did not zero this out? Well, all of the gi’s are independently chosen, so any of those could have been the largest. Probably that this is the largest is 1 over k, right? And, okay. Let's say that I did not zero this out with property one over k and suppose I know that this is the largest in this row, what does its expected value look like? So this is a simple exercise; so if you were to pick k standard normal random variables and look at what is the expected value of the maps, it is -- I'm sorry, less than or equal two log k. Okay? So keeping this in mind I get that the expected value of the first entry, or the expected value of the ith entry in the first column is vi squared log k on the condition that it being nonzero, condition on it being larger than zero. Okay. So it's all expected value is exactly this term divided by k. The expected value of the ith term is normal Vi squared log k and divided by k. So to compute the expected value of the denominator I just sum this over all entries, right? And recall that I have normalized the Vi’s in such a way that summation over i of di, vi square was exactly equal to k. So the denominator is exactly equal to log k. Okay? Any questions? So this was easy part. So the numerator requires a bit of work. So the numerator consists of terms which look like Xi and Xj, right? Xi minus Xj, and as before I can calculate the expected value of Xi assuming Xi is nonzero and I can calculate the expected value of Xj when Xj is nonzero. So your Xi and Xj are like normal Vi squared times this, Vi tilde, g1 squared if this thing is the largest in its row. Again, so throughout all this I want to keep this picture in mind. This is the projections and this is the thing that I did not zero out. Okay? So Xi is like normal Vi squared times this random variable if it is largest in this row; otherwise it's zero, and same thing as Xj. So I need to calculate, there are four cases here, case one both of them are zero, in which case I don't care if both of them are zero. Case two is when both of them are nonzero. Well, I'd ask you to believe me on this but the probability of that happening is small, so that is probably not something that you should worry about. The most interesting case is when one of them is zero and one of them is nonzero. So we need to bound the probability of that event happening. So this is what it looks like. So there's this unit here, Vi tilde [inaudible] Vj tilde R, vectors on these units here and I want to calculate the probability that Xj is greater than or equal to zero and Xi is zero, right? So suppose g1 were to look something like this. So when would Xj be nonzero? So this would roughly be nonzero when Vj tilde aligns itself, aligns very well with g1, right? So this, keep this picture in mind. So this was my rounding algorithm. So when would V1, g1 be the largest in its row? You know, when it's the largest among all of these projections, that means Vi is aligning very well with g1, or at least better than its aligning with all of the other gi’s, right? So if Xj is going to be nonzero then it means that most probably it is going to lie in this shaded region around g1 over here. Because if you could say that, if it does not lie in this shaded region then with high probability it is aligning much better with some other gi, okay? So to calculate the probability that Xj is greater than zero and Xi is nonzero, so that should depend on the distance between these two vectors, right? If these two vectors are already close to each other then they would behave the same, then these two random areas would behave the same. If these two vectors are orthogonal then they just behave independently, right? So using this you can show that the probability of a cut being made in the sense that probability that one of them is zero and one of them is nonzero is upper bounded by 1 over k times the distance between these two vectors times square root lambda k. So the square root lambda k comes because it is like the size of the ith cap on the units here. Okay? Oh and this was formally proven by Charikar, Makarychev and Makarychev. Any questions? Okay. So moving on, now I'm ready to calculate the expected value of Xi minus Xj. When Xi is zero and Xj is nonzero the expected value of Xj looks like this, Vj squared times log k and the probability of that event happening is this and when the opposite event happens, Xi is nonzero, Xj is greater than zero the expected value looks like this, and again the probability of that happening is this, right? So well, I asked you to bear with me on this slide. This is the only [inaudible] slide that has math on it, this much math on it. So by doing some sort of simple rearrangement I can upper bound it by this term, 1 over k times the sum of length times the distance between them times the log cubed k, so I'm collecting this square root log k and this log k. Okay? So at this point whenever I see a [inaudible] that looks like this it's sort of asking you to plug-in CauchySchwartz inequality, right? You sum this over all edges and apply Cauchy-Schwartz over there. So what do you get? The one over k stays as is. You get something like Vi minus Vj all squared summed over all edges times summation over i Vi squared, right? And what was this value? So if you remember this picture that I showed you before, the summation of the edges of Vi minus Vj square is lambda k times this thing and this thing is equal to k, so summation over i, summation over edges of Vi minus Vj whole square is lambda k times k. So let me plug that in over here so this is lambda k times k and this thing also is just k, so the square root k and k cancels and you are left with square root lambda k log cubed k. Okay? So any questions? Okay. So let's do a quick overview, so we bounded the probability that Xi, one of them is zero and the other one is nonzero which was this, and we know how to calculate the expected value of one of them and the expected value separately. And I didn't show you what to do when both of them are nonzero, but that is a very simple case and it happens with very small probability, so you can ignore that. So once you know how to do this you just sum them over all edges like Cauchy-Schwartz and you are done. Okay? Okay. Until now I showed you that the denominator is log k and the numerator is the square root lambda k log cubed k. And now I'll show you that for constant fraction of the indices, the expected value of the ratio is bounded by some constant times the ratio of the expected values, which is square root lambda log k. Okay? So again, in this part we will crucially use this property that we have for our embedding, that the average value of the inner product square is like one over k, and so this already indicates that there is a high correlation into clusters. So as a [inaudible] experiment consider the following case when any two of the Vi’s are either the same or orthogonal to each other. So any two Vi’s and Vj’s are either equal or orthogonal to each other. So in this case I claim that the rounding algorithm is like throwing k balls into k bins. So what are the balls and bins over here? So the k bins are the k vectors that you have and the k balls are the k clusters of the Vi’s that you have, and why is this like throwing balls into bins? So if you recall what our rounding algorithm did, so you took predictions of some Vi onto the gi’s and whichever one was the largest you sort of assigned it to that vector, right? So the vectors are like the bins and I say that my ball has gone into some particular bin if the vector that I picked is largest in that -- if the Vi that I picked is the largest in that gi, right? So the rounding algorithm is sort of like throwing k balls into k bins and what happens when you throw k balls into k bins? You would expect that something like 1-1 over reflection of them are nonempty, right? >>: [inaudible] you don't have independence, right? You have independence. >> Anand Louis: No. So that's why I say you assume for this experiment, consider for this experiment where the two Vi’s are either equal or orthogonal to each other. So in this case when one vertex goes the whole cluster goes with it and since they are orthogonal they are independent. This is just intuition. And so in this case you get that constant fraction of the denominators are large, you know, like whatever one ball goes into a bin, that bin is like large. And so this is the intuition behind trying to show that the denominator is large but unfortunately this does not work. It would have been -- we could not make it work. It probably works. I don't know. We have to go through the boring way of trying to bound the variance of the denominator. Okay? So what does the variance look like? So the variance is the summation of expected values of Xi and Xj. And what does this term look like? So we know what is the expected value of Xi when it's nonzero. It's something like Vi squared times log k. We know the expected value of Xj when it's nonzero. It's something like Vj squared times log k. And what we need to bound is the probability that both of them are nonzero, right? So again, remember that, recall that Xi was this random variable if the projection on g1 was the largest in that row, similarly Xj. So I'll show you that the probability that both of them are nonzero can be upper bounded by the inner product of vi tilde and vj tilde all square divided by k +1 over k square. Okay? So again, think of this picture in mind. So again, let's consider two boundary cases when vi tilde and vj tilde are orthogonal to each other, okay? So in that case what is the value of this probability? Now if vi tilde and vj tilde orthogonal to each other then these two random variables are independent random variables, right? Xi and Xj are totally independent random variables, so the probability that both of them are greater than zero is 1 over k square. And the other boundary case to look at is say when vi tilde is equal to vj tilde. So in that case what is the probability that both of them are greater than zero? It's just one over k, right, which we have over here. So essentially what this statement is saying is that in the general case is a convex, combination of the two boundary cases. The two boundary cases being then once they are both completely independent then one when they are like completely correlated. So proving this requires a bit of work and I'm not going to prove it for you. You just have to believe me. So roughly a high-level idea is that you can think of each Xi as a function from the Gaussian space to the real line, because each Xi is a function of the vi and all the g1 through the gk. And you can write this function in terms of its hermite basis. The hermite form, they form a complete eigenbasis for the Gaussian space, so this and some Fourier analytic tools on top of this gives you this statement essentially. So but yeah, proving that requires some amount of work. But anyway, let's assume this and try to finish the proof of the variance. So I have that vi square, normal vi square times vj squared times log square k times this term over here. So rearranging summation I can right this as inner product of vi vj whole squared all divided by k plus summation over i of vi squared whole squared whole squared, right? So what are these two values that we had? So again, going back to this picture, note that summation over ij of vij whole squared is equal to k and summation over i of normal vi squared is equal to k. So if I plug this in over here, what do I get? This equal to one. This is also equal to one, so your variance is this 2 log square k. Any questions at this point? So I guess the most crucial part here is this is where we crucially use the fact that the algorithm, the totally inner product square is small, so the sort of indicates that your vectors are strongly correlated and this is where we are using this correlation. So if we were to pick some n random vectors in k dimension they do not satisfy this bound and that is what we are crucially using here. So at this point we are sort of done, right? I showed you that the expected value of the numerator is square root lambda k log cube k, so using Markov inequalities something like 99% of the numerators are smaller than this, smaller than some hundred times this. And I can bound the denominator by using Chebyshev’s inequality, right, so the expected value of the denominator is log k. The variance is log square k. So the probability that the denominator is at least log k over 2 is something like one fourth. So a constant fraction of your denominators are large, therefore taking the union bound over these two events there are at least like k over 8 such indices for which the numerator is small and the denominator is large and that's it. So we are done. Any questions? >>: [inaudible] take care of all the [inaudible] constant [inaudible] >> Anand Louis: Right, so well I've actually proved a weaker theorem for you. I told you that you can get one minus epsilon times k, so to do that you need to do this like much, much carefully, but right now this proof only shows that you get some constant times k sets. So that requires pretty much the same idea but a much more careful analysis. Okay? So you can also asked this computational question that given a parameter k, I want to find the best k partition so as to minimize, minimize the maximum expansion. So this can be thought of as like a extension of this sparsest cut problem. So this is actually a very tricky problem because, you know, any of the standard STP formulations that you try they can have a unbounded inner [inaudible] gap. They can be as large as you want them to be, so standard [inaudible] gap proves that you would have like one indicator vector to indicate like which set. You'd have like k vectors for each vertex to indicate which set that your vertex belongs to and you can really show this has a [inaudible] gap as big as you want. And so I did not say this in much detail but [inaudible] came up with an interesting STP formulation which inherently takes care of all of the problems that we seem to have with the standard STP formulation and using a similar rounding algorithm we can, we can get an algorithm that outputs one minus epsilon times k sets each having expansion of times square root log n log k. So [inaudible] noted so any upper bound, so an approximation algorithm for this does not imply anything for the min sum partitioning. Okay? So yeah, that's all I have to say. And probably the most stressing question is can you get, can you lose this [inaudible] units in the approximation that we had. Can you get like exactly k sets, k non-empty disjoint subsets is that each has expansion square root lambda k log k, right? >>: Your [inaudible] gap is the gap you show [inaudible] slides [inaudible] example. >> Anand Louis: No. There I was, that was for a partition, so right now I'm just saying that there should be k non-empty disjoint subsets. I'm not insisting on a partition, so this could just be k nonempty disjoint subsets. So if you remember the example that I showed, the k clicks would themselves be these k sets that you want. And, so another problem that is of interest related to this is a small set expansion problem where given a parameter k you want to find a set of size n over k which has the least expansion. So what I showed you before is that, so what I showed you today also implies an upper bound for small set expansion, right? So if you find a k partition then one of those sets is going to be smaller than n to the -- it's going to be smaller than n over k. So for small values of k that's fine, but for large values of k it is, it can be improved. So like Arora, Barak and Steurer showed that you can upper bound this, you can find a set of size n over k whose expansion is at most lambda sub k to the c where c is some constant like 65 times the square root log n base k. So here’s an interesting [inaudible] k is like some n to the epsilon where you essentially lose the log n to the base k factor here. So if you're interested in the like sub exponent share time algorithms for unique games, then the k that you are going to plug-in is going to be something like n to the epsilon, so in that regime this is more interesting than the theorem I showed you. And that's it. Thanks for coming and I would be glad to take questions. >>: Here you always examine the number of edges from Si to the rest of the graph. Are there any results in which you care about the edges between Si and Sj? Or all of there’s ij which would be much more… >> Anand Louis: That would like some regularity partition. >>: I don't know what you mean by that. >> Anand Louis: So, so that's acting like similarity regularity and stuff. Yeah, I haven't seen any problems like the ones that you describe. >>: [inaudible] what you want is small expansion into each which is a much [inaudible] condition. >>: So you want to look at the maximum case where there's pairs and minimize that. >>: For example. [inaudible] I take some normal whatever [inaudible] >>: What you want in this case is take [inaudible] >>: Yes, yes exactly. Even just taking, having [inaudible] is much harder. >> Anand Louis: Yeah, I doubt if it has any connection to the eigenvalues, but… >>: Why the obsession with eigenvalues? I mean it's nice to have okay, but why? What's that mean? What’s it mean to connect I get into the k eigenvalues to partitions or to learn about partitions? >> Anand Louis: Well the reason why we started looking… >>: [inaudible] >> Anand Louis: Hum? >>: I thought your aim was to learn about partitions, so [inaudible] >> Anand Louis: Right, so the reasons why we started looking at this problem… >>: [inaudible] one method [inaudible] eigenvalues. >> Anand Louis: Yes. I'm saying that. >>: It's not compulsory. >> Anand Louis: Right. But yeah, the reason why we started looking at this is because of its connection to unique games, so I, so then like I said if you get like better bounds for this you could get like slightly better algorithms for unique games. So that is why, you know, all of this, that is why this expansion problem gained fame in the first place. >>: Also, would say if you are interested in [inaudible] eigenvalues [inaudible] so that's what people [inaudible] and practice [inaudible] to partition data [inaudible] >>: And so the same [inaudible] in place for the methods [inaudible] >>: [inaudible] do things that seem to be computationally [laughter] >>: I know. I know, but it's, the answer is we do it because we can. >>: Yes. [inaudible] because maybe [inaudible] cures these algorithms really [inaudible] so well that it becomes [inaudible] something [inaudible] >>: More questions? Maybe more suggestions for your problems? Is that even better? [inaudible] let's thank the speaker again. [applause]