>> Yuval Peres: Laurent Massoulie from the MSR India Center in Paris will tell us about community detection thresholds. >> Laurent Massoulie: Thank you, Yuval. It's a pleasure for me to give you this talk this morning. So what is it about? Community detection is basically the same as clustering. It's about identifying groups of objects with similar characteristics from within a global population. And so embedding is also closely related objective. You may want to embed your individuals within a space and then do clustering after having performed embedding. So if I had to state one application, then I would say this is a useful primitive, for instance, for recommending contacts on line social networks you might process the friendship graph and recommend users to connect to people that constitute their implicit community. So so much for motivation. Here's another one, but I will not dwell on it. So the main character in this talk will be the stochastic block model. So it's a random graph model whereby we have N nodes and N being large and each node belongs to some community, specifically we sample in an IID fashion spins for each of the nodes so each node might pick a type out of K different types in an IID fashion and then conditional on those spins we decide independently to create an edge between two nodes. With a probability that depends on the underlying types. And so B is the probability that a function of the type of the two N nodes sigma U sigma V but this is then scaled by S over N where N is the population size and S should be thought of as the signal strength in the observation. It's pretty much the average degree of the graph. It gives you the scale of the average degree in the graph. And so what we observe is this random graph that is a realization from the stochastic block model and we'd like to recover the underlying communities as much as possible. So before I say a bit more precisely what I want to cover within the talk, let me recall what spectral clustering the classical version is. Is just the fact of processing these adjustments in matrix, extracting the eigenvectors corresponding to the largest eigenvalues, and you do an embedding using these eigenvectors having normalized them. If you pick R eigenvectors you can embed the nodes in R dimensional Euclidian space and then you can do K means clustering what not on the basis of this. And so what I really want to tell you about today is about a phase transition that occurs when you want to do inference about these underlying communities when we have a very sparse graph that we observe. So order one average degree. But before getting to that, I will do the tour and say a few things about the case where we have a little more edges to work with, because this will help me introduce some notions of spectral separation, which will be helpful in understanding the remainder. So let's assume to start with that we have what I call a rich signal that is the average degree parameter S is at least logarithmic in the system size. And so in that case modulo, an assumption that you could distinguish the different clusters, that they are statistically distinct which is not really a big deal because if they are not then they should be considered then as the same cluster. So then classical spectral clustering will work essentially because the spectrum of the matrix A will consist in small number of eigenvalues that stand out. So the R number is going to be less than the number of the underlying communities. The magnitude of the eigenvalues will be order S, at least, and the remainder will be negligible. It will be on the order of square root of S. So in principle, by putting up a threshold, you could extract the right eigenvalues. Do the embedding and moreover what is true in this regime is that the representatives of the nodes in this embedding will cluster according to the underlying communities, except perhaps for vanishing a small fraction of the nodes. So here's a simulation of this setup with the four blocks where N is not going to go to infinity we have some scattering where they are suppose the node representatives are supposed to cluster, but still you see the phenomenon starting to emerge. And so let me say a few words about why this holds. So recall that the adjustments in matrix that we are going to work with can be decomposed into its expectation which has this nice block structure and which has low rank on the order of the number of blocks. And the eigenvalues on the order of the signal strength S and then after centering A you create this random matrix that you would add to this expected value matrix. And so what we want to leverage is some kind of spectral separation, essentially saying the random matrix has a very small spectral radius which will entail that the spectral structure of the purchase of the matrix is very close to the purge of the matrix. So the prototype for the spectral separation is so called Ramanujan property which is something that was introduced in the '80s by Lubotzky et al., is defined for S regular graphs and says that second largest eigenvalue ought to be as small as possible and that is on the order of the square root of S. So that's the definition of a Ramanujan graph and we know by Friedman's work in 2008 that the random graph with high probabilities almost Ramanujan in that the second largest eigenvalue is close to the bound in the definition of those Ramanujan graphs. What we need to establish, what I just described, is something else on Erdos Renyi graphs, this time. So not this irregular graphs. And Feige-Ofek in 2005 established similar property so long as the degree now is at least logarithmic. We still have this behavior that the second largest eigenvalue of an Erdos Renyi graph is on the order of the square root of the average degree and related result is that if you center these adjustments in matrix then you have a pertubation matrix whose spectral radius is on the order of the square root of the average degree. So these are all relaxations of the original Ramanujan graph definition. So you could say you have almost Ramanujan graphs, somewhat Ramanujan graphs, and we'll see further weakenings of this definition that will be useful for our purposes. So with this result in hand, you can go back to the stochastic block model, consider the adjustments in matrix after centering. And using Feige-Ofek, you ensure it has spectral radius on the order the square root of signal strength. This is on the order of the leading eigenvalues of the expected matrix, so we can indeed say that the spectral structure of the adjustments in matrix is close to that of the expected adjustments in matrix. But what we can say also is that if we let the signal strength go down, then this breaks down, because we know that the spectral radius of the noise matrix will be dominated by the largest degrees and for order one degrees, well, the spectral radius would be on the order of square root of log N over log NN whereas we may have S as low as one. We know that classical spectral clustering has to break down for signal strength on the order of log N over log NN. But we may still do other things other than classical spectral clustering, and this is what I want to get to, focusing now on weak signal strength and on an interesting phase transition phenomenon in that regime. So let's now assume that signal strength is of order one. We know then that we cannot correctly recover the underlying clusters because we will have isolated nodes, for instance, and there's no way we can tell which community an isolated node belongs to. So we have to set up for less ambitious objective. So the objective will be then to achieve a good overlap. So guess community labels and make sure that the agreement between the estimated labels with the true underlying labels is as large as possible. And this is what this overlap metric measures. It counts the number of nodes for which we guessed right and there is enough said that is removed to take into account the fact that you could assign to each node the same type and that would not be meaningful any how. So with this definition at hand what you might expect is as you reduce the signal strength, then the best you can do is to achieve some positive overlap that is not one but slightly less than one but that decreases continuously until the point where you don't have a giant component anymore. So that would be the naive guess. But it turns out that something more interesting happens. You have a transition point prior to the disappearance of the giant component where the overlap has to be zero, below which the overlap has to be zero and still you have a giant component. So this is the intermediate phase between signal strength at zero and SBM I'm illustrating here. So this is what I want to look at now. And specializing further now on the simplest nontrivial community, stochastic block model that you can imagine. That is two communities roughly equal sizes so the spins now are plus or minus signs. And we have just two parameters, the parameter A characterizes the probability of an internal edge within a community. So this probability would be A over N. And then you would have a second parameter B, which characterizes the probability of an entire community edge which would be B over N. And so in that context, physicists, Decelle et al in 2011 made a conjecture that there will be a threshold tau depending on the parameters A and B such that for tau less than one the overlap has to be 0. So you cannot make any meaningful inference about the underlying communities, the signal is simply not useful to that end. And so this part of the conjecture was proven in 2012 by Mossel-Neeman and Sly, and the other half of the conjecture made by Decelle et al., is if the tau parameter is above 1 then positive overlap can be achieved and in the original paper they said that to be achieved using belief propagation. So message passing algorithms and they have numerical evidence that indeed this is the case, and there is more recent conjecture by Decelle et al., as well as Mossel Neeman Sly this is KMMNSZ Zhang 2013 paper, where they come up with spectral algorithm they called the spectral redemption, and they conjecture this one can achieve positive overlap when tau is larger than one. But until November 2013 there was no proof that indeed the positive part of the conjecture held but now we are in much better state of affairs because we have two proofs. So I came up with one and then a week later Mossel Neeman and Sly posited yet another proof so we have plenty of proofs with two different methods to achieve positive overlap above the transition point. So let me now tell you how this proof works and what is the method used to achieve positive overlap above the transition point. So this is going to be done using a modified spectral method and the keys to introduce the right matrix in which we want to do spectral clustering, we are no longer working directly with the original adjustments in matrix but instead we're constructing matrix which counts adjustment C at a distance somehow and more precisely I take a pass length parameter L and for each pair of nodes I and J I count the number of self-avoiding paths in the graph between I and J. So this is what this matrix B is about. And the typical situation is that for a node I, I would have a tree-like local neighborhood in which case BIJ would be one leaf the graph distance is precisely L between the nodes I and J. This is the typical case. But if we have cycles then it might be different. For instance, the second case here you may have two self-avoiding walks of distance L between I and J. And you may ignore the third case for now. So the main result is about the spectral structure of this matrix which then implies we can do some clustering and achieve a positive overlap. So if we pick the path length to be logarithmic in the system size, then the spectral structure of this matrix is such that there is a leading eigenvector whose eigen -- there's a leading eigenvalue, sorry, of the order alpha N to the order where it's A in this graph A plus B over two. There's a second eigenvalue which is on the order of beta raised to the L where beta is another key parameter in this Model, A minus B over 2, and we know also that the corresponding eigenvectors are aligned with vectors we know quite well. For the first eigenvalue the eigenvector is aligned with vector obtained by applying the old one's vector to this matrix B. And for the second eigenvalue the eigenvector is aligned with the vector obtained by applying the spin vector to the spin matrix B and we have a third and remaining eigenvectors which are O of essentially square root of alpha to the L. So -- yes? >> What's the dependence on this model C that it chooses N as C sub times log N. This is true for ->> Laurent Massoulie: I need C to be positive but I need alpha C to be less than one-fourth. So there is a constraint on C here which has to do with the presence of cycles in the neighborhoods of the nodes. But that's the range of parameter, parameter values considered, I can cope with. And so the fact that the third eigenvalue up to the N to the epsilon for an arbitrary positive epsilon is of the order of the square root of the first is what I call a weak Ramanujan property. And so the final statement in the main result is that the second eigenvector correlates with the underlying communities so I can do thresholding on the second eigenvector and I will achieve my community detection. All right. So what I want to do in the remainder of the talk is describe the key ingredient in the proof and then conclude. So this is just an illustration that it seems to work in practice. So I tried it out and so here you should see the overlap become positive. It's not completely obvious. Maybe I should do more simulations to confirm the theory. But it does not disprove it at least. >> Confirm the simulations? >> Laurent Massoulie: So the key step is to introduce a matrix expression. So in order to do that, what I'm introducing here is the expectation of the adjustments in matrix conditional on the spins. So it's simple matrix with essentially a ring two if you ignore the diagonal terms so this can be expressed in terms of the old ones vector and the vector of spins in this way. And based on this, it's useful to introduce now what I call a centered path in matrix, which is essentially constructed as the matrix B was from the original adjustment symmetrics, but it's now constructed from the centered adjusted symmetrics so the IG entry is just the sum over self-avoiding walks of the I and J of the products of these terms. Right. So once I have this at hand, I can write an expansion. So I can consider this and say, well, I'll expand those products and group them according to the place at which the last A bar product appears. And what you see by doing that is that after this last A bar product term appears you'll have only eight terms. And since you are considering self-avoiding paths here, you will have essentially terms that correspond more or less to the matrix B but to a corresponding to a shorter path length. So you essentially find your matrix B is this perturbation matrix plus some expansion which involves the matrix B with lower indices and perturbation matrices with lower indices. So this is the first step. And this is crucial because we can do some work on this delta matrix whereas working with the B matrix is hard. So indeed we can use classical tools for controlling spectral radial of random matrices to control this delta matrix and basically we can use this trace method so we can look at the trace of the matrix raised to some power and by combinatorial arguments actually leveraging Heridian Comlos's [phonetic] work in the '80s. So there's a paper by Heridian Comlos[phonetic] which does this kind of control. We throw in the additional ingredient that we are considering self-avoiding paths and this gives us some control in the spectral radius in the end and essentially what we have is this perturbation matrix has spectral radius ignoring the first term that is on the order of the square root of the degree raised to the L. So this is one key ingredient in there. And then the second ingredient is what you could think of as a local analysis, just need now to work with the local structure of the neighborhoods in this graph. And so the ingredient here is to show that if you look at the sizes of the neighborhoods at distance T as well as the sum of spins at distance T, this has some kind of a quasi deterministic growth pattern. This is what is written here in red. That's the number of neighbors at distance C. This is roughly the number at distance L scaled by constant plus some perturbation, similarly for the sum of spins. And so from this you can -- so this is an intermediate step. I'm nearly done. So I'll put that together with the control of the spectral radius on the next slide. So now if you look at the supreme of norm one vectors which are also similar to the candidate eigenvectors VL and VL sigma, applied to B raised to some minus 1 and here the E vector then you can control that. You get a square root in term but you get also an alpha to the M of two term. And the reason why this holds is that these vectors here are, in the precise sense close to the vectors of sizes of neighborhoods and spin sums at given distances. So if you force X to be orthogonal to these two things, then it will be orthogonal to the first terms in here. So you will be left with the perturbation terms here only. And so I'll put everything together now. We can show that if we restrict ourselves to unit vectors orthogonal to the candidate eigenvectors then VL times X has a norm that is on the order that we're interested in. That is on the order of the square root of the average degree raised to the L. And basically this is a combination of the ingredients I've just given. So the expansion, the controls on the spectral radii of the coefficients in the expansion and this last fact. So just putting it all together gives the result. So the rest is much easier. This is the most known use step in this proof and you can conclude by leveraging coefficient theorem and controlling the norms of the candidate eigenvectors applied to this B matrix, and this allows to conclude about the spectral structure. And there is more work to be done in order to show that the second eigenvector correlates positively with the underlying community structure, but this is again the local analysis working on the local neighborhoods of nodes characterizing how they behave, relating this to random tree model that is the natural model for those neighborhoods. So I'm done. Let me conclude now and mention some outlook. So in the key method is that you can recover this Ramanujan-like spectral separation by using these kinds of spectral, of path expansion techniques, working with this matrix B rather than with the original adjacency matrix. So this may have consequences beyond this highly stylized model. For instance, we have generalization of the conjecture when we have a model with not just spins but also labels on the edges which is something we introduced motivated by the Netflix prize dataset, where the edges would be between a movie and a user and they would be labeled with a number of stars that the user gave as a rating to that movie. So we have a generalization of this threshold phenomenon for those labels stochastic block models and needs to be -- needs to be done to see if we can generalize this path expansion to prove this generalized conjecture. And also this technique may be used to prove the other conjecture made more recently by the KMMNSZ paper. So I've not said what spectral redemption is, but I can say in a nutshell. So the way they propose to identify the communities is to form an edge-to-edge matrix and you would connect an edge to another edge in this matrix, if they have a common endpoint. But you forbid so they're oriented edges. So the head of the input edge should be the tail of the output edge, and you prevent backtracking. So you cannot go back along an edge. So this defines an edge-to-edge matrix. So their conjecture is above the threshold it has one second eigenvalue that stands out and whose eigenvector can be leveraged similarly as what I've been describing. So there is hope that we could use the same method that I used with the matrix expansion and the trace bound to establish this. And then there are a bunch of questions like I'm not entirely sure that stochastic block models are a good model for all the applications I just briefly mentioned. So I would be interested in knowing whether this is the case. For one thing, you certainly need to allow more flexibility like allowing general degrees distrubution which Fan Chung has done to some extent in a recent paper with Camlak Chadury [phonetic] but it's not clear it's a good model for the data we're interested in and then there are plenty of other questions, the speed of convergence, embedding dimension and so forth. So with this I will stop and here are the references. So this is the paper with the results I was describing and here is the other proof that is available of this conjecture by Mossel Neeman and Sly. Thank you. [applause] >> Questions. >> So similar model the regular issue, so can define -- is there some block model on the regular graph and are there similar questions that can be answered? >> Laurent Massoulie: Yes, you can. >> You don't have this independent of operation ->> Laurent Massoulie: For instance, you could have some versions of the configuration model so you could sample -- so fix the degrees, sample the number of edges that are intra and inter. And there may be different models for doing that. This could be done independently throwing coins to determine that. And then once you have decided which edges are inter and intra, you do a random matching of the half edges. And here there may not be this intermediate phase, because the regularity may help, but I'm not entirely sure. You value seems to be skeptical there may not be such an intermediate phase. >> I haven't worked with this one. My guess there would be an intermediate phase. >> Laurent Massoulie: So there is one version where you say each node will have three neighbors internal and two external and for this one I think maybe there is no such intermediate phase. If you randomize, maybe it makes this intermediate phase appear. Another thing which we have been thinking is what could be said when you have more than two communities because as you increase the number of communities, things become even more interesting. So there is something happening when you have five or more communities where you have a phase where the physicists tells us below a transition point all spectral methods will fail, but still some nonpolynomial methods should work. So we -- well, we could try to prove that we have spectral methods working all the way up to that point. So these things are better understood in the case of the reconstruction on the tree in which Yuval has done lots of very interesting works. But so translating what is known on the trees to the stochastic block model is already a challenge, and this intermediate phase is quite a mystery. >> Is the number 5, is it specifically five or five sort of known before you ->> Laurent Massoulie: I think for up to four included this intermediate, this additional phase does not exist. At least for the symmetric situation where you have let's say four communities, two parameters A and B for inter and intra edges. Then there would not be such an additional phase. >> [indiscernible] intermediate or the result -- two immediate [indiscernible]. >> Laurent Massoulie: Right now it's returned for four. I'm confident that it could be extended. But I have not done that. So I'll be cautious. >> Even in the case of trees, there was a big gap between case of two [indiscernible], the number of colors even from two to three, turned out that the spectral methods were sharp. But it was much, much easier for two than for three. And then the fact that they break down at five. [indiscernible] should have ended up here. >> Laurent Massoulie: Yes. Yes. So discussing with Lancaster Borava [phonetic] one of the authors of the original conjecture, she tells me she thinks this will have an analog and in this hard wrench no polynomial time algorithm should be able to achieve positive overlap, whereas maximum likelihood would somehow. But I don't know what is the basis for this guess that they are making. >> Believe at the level maximum likelihood would and no spectral efforts, generalized to all polynomial [indiscernible]. >> Laurent Massoulie: Yes, yes, I guess this has to do with ->> Sophisticated polynomial. >> Laurent Massoulie: Yes, I asked her, and I guess it has to do with the work on the energy long scale that they've been doing on the -- I forget the name, but there was an attempt to make message passing work even beyond the known limit and they came up with a survey propagation method and so they developed an understanding of the attractors of those iterative schemes. So I guess this comes from there. But anyhow.... >> Yuval Peres: Okay. Thank you again. [applause]