Document 17831832

advertisement
>> Yuval Peres: Laurent Massoulie from the MSR India Center in Paris will tell
us about community detection thresholds.
>> Laurent Massoulie: Thank you, Yuval. It's a pleasure for me to give you
this talk this morning. So what is it about? Community detection is basically
the same as clustering. It's about identifying groups of objects with similar
characteristics from within a global population. And so embedding is also
closely related objective. You may want to embed your individuals within a
space and then do clustering after having performed embedding.
So if I had to state one application, then I would say this is a useful
primitive, for instance, for recommending contacts on line social networks you
might process the friendship graph and recommend users to connect to people
that constitute their implicit community.
So so much for motivation. Here's another one, but I will not dwell on it. So
the main character in this talk will be the stochastic block model. So it's a
random graph model whereby we have N nodes and N being large and each node
belongs to some community, specifically we sample in an IID fashion spins for
each of the nodes so each node might pick a type out of K different types in an
IID fashion and then conditional on those spins we decide independently to
create an edge between two nodes. With a probability that depends on the
underlying types. And so B is the probability that a function of the type of
the two N nodes sigma U sigma V but this is then scaled by S over N where N is
the population size and S should be thought of as the signal strength in the
observation. It's pretty much the average degree of the graph. It gives you
the scale of the average degree in the graph.
And so what we observe is this random graph that is a realization from the
stochastic block model and we'd like to recover the underlying communities as
much as possible. So before I say a bit more precisely what I want to cover
within the talk, let me recall what spectral clustering the classical version
is.
Is just the fact of processing these adjustments in matrix, extracting the
eigenvectors corresponding to the largest eigenvalues, and you do an embedding
using these eigenvectors having normalized them. If you pick R eigenvectors
you can embed the nodes in R dimensional Euclidian space and then you can do K
means clustering what not on the basis of this.
And so what I really want to tell you about today is about a phase transition
that occurs when you want to do inference about these underlying communities
when we have a very sparse graph that we observe. So order one average degree.
But before getting to that, I will do the tour and say a few things about the
case where we have a little more edges to work with, because this will help me
introduce some notions of spectral separation, which will be helpful in
understanding the remainder.
So let's assume to start with that we have what I call a rich signal that is
the average degree parameter S is at least logarithmic in the system size. And
so in that case modulo, an assumption that you could distinguish the different
clusters, that they are statistically distinct which is not really a big deal
because if they are not then they should be considered then as the same
cluster. So then classical spectral clustering will work essentially because
the spectrum of the matrix A will consist in small number of eigenvalues that
stand out. So the R number is going to be less than the number of the
underlying communities. The magnitude of the eigenvalues will be order S, at
least, and the remainder will be negligible. It will be on the order of square
root of S. So in principle, by putting up a threshold, you could extract the
right eigenvalues. Do the embedding and moreover what is true in this regime
is that the representatives of the nodes in this embedding will cluster
according to the underlying communities, except perhaps for vanishing a small
fraction of the nodes. So here's a simulation of this setup with the four
blocks where N is not going to go to infinity we have some scattering where
they are suppose the node representatives are supposed to cluster, but still
you see the phenomenon starting to emerge. And so let me say a few words about
why this holds. So recall that the adjustments in matrix that we are going to
work with can be decomposed into its expectation which has this nice block
structure and which has low rank on the order of the number of blocks. And the
eigenvalues on the order of the signal strength S and then after centering A
you create this random matrix that you would add to this expected value matrix.
And so what we want to leverage is some kind of spectral separation,
essentially saying the random matrix has a very small spectral radius which
will entail that the spectral structure of the purchase of the matrix is very
close to the purge of the matrix. So the prototype for the spectral separation
is so called Ramanujan property which is something that was introduced in
the '80s by Lubotzky et al., is defined for S regular graphs and says that
second largest eigenvalue ought to be as small as possible and that is on the
order of the square root of S. So that's the definition of a Ramanujan graph
and we know by Friedman's work in 2008 that the random graph with high
probabilities almost Ramanujan in that the second largest eigenvalue is close
to the bound in the definition of those Ramanujan graphs. What we need to
establish, what I just described, is something else on Erdos Renyi graphs, this
time. So not this irregular graphs. And Feige-Ofek in 2005 established
similar property so long as the degree now is at least logarithmic. We still
have this behavior that the second largest eigenvalue of an Erdos Renyi graph
is on the order of the square root of the average degree and related result is
that if you center these adjustments in matrix then you have a pertubation
matrix whose spectral radius is on the order of the square root of the average
degree. So these are all relaxations of the original Ramanujan graph
definition. So you could say you have almost Ramanujan graphs, somewhat
Ramanujan graphs, and we'll see further weakenings of this definition that will
be useful for our purposes.
So with this result in hand, you can go back to the stochastic block model,
consider the adjustments in matrix after centering. And using Feige-Ofek, you
ensure it has spectral radius on the order the square root of signal strength.
This is on the order of the leading eigenvalues of the expected matrix, so we
can indeed say that the spectral structure of the adjustments in matrix is
close to that of the expected adjustments in matrix.
But what we can say also is that if we let the signal strength go down, then
this breaks down, because we know that the spectral radius of the noise matrix
will be dominated by the largest degrees and for order one degrees, well, the
spectral radius would be on the order of square root of log N over log NN
whereas we may have S as low as one.
We know that classical spectral clustering has to break down for signal
strength on the order of log N over log NN. But we may still do other things
other than classical spectral clustering, and this is what I want to get to,
focusing now on weak signal strength and on an interesting phase transition
phenomenon in that regime.
So let's now assume that signal strength is of order one. We know then that we
cannot correctly recover the underlying clusters because we will have isolated
nodes, for instance, and there's no way we can tell which community an isolated
node belongs to. So we have to set up for less ambitious objective. So the
objective will be then to achieve a good overlap. So guess community labels
and make sure that the agreement between the estimated labels with the true
underlying labels is as large as possible.
And this is what this overlap metric measures. It counts the number of nodes
for which we guessed right and there is enough said that is removed to take
into account the fact that you could assign to each node the same type and that
would not be meaningful any how. So with this definition at hand what you
might expect is as you reduce the signal strength, then the best you can do is
to achieve some positive overlap that is not one but slightly less than one but
that decreases continuously until the point where you don't have a giant
component anymore. So that would be the naive guess. But it turns out that
something more interesting happens. You have a transition point prior to the
disappearance of the giant component where the overlap has to be zero, below
which the overlap has to be zero and still you have a giant component. So this
is the intermediate phase between signal strength at zero and SBM I'm
illustrating here. So this is what I want to look at now. And specializing
further now on the simplest nontrivial community, stochastic block model that
you can imagine. That is two communities roughly equal sizes so the spins now
are plus or minus signs. And we have just two parameters, the parameter A
characterizes the probability of an internal edge within a community. So this
probability would be A over N. And then you would have a second parameter B,
which characterizes the probability of an entire community edge which would be
B over N. And so in that context, physicists, Decelle et al in 2011 made a
conjecture that there will be a threshold tau depending on the parameters A and
B such that for tau less than one the overlap has to be 0. So you cannot make
any meaningful inference about the underlying communities, the signal is simply
not useful to that end. And so this part of the conjecture was proven in 2012
by Mossel-Neeman and Sly, and the other half of the conjecture made by Decelle
et al., is if the tau parameter is above 1 then positive overlap can be
achieved and in the original paper they said that to be achieved using belief
propagation. So message passing algorithms and they have numerical evidence
that indeed this is the case, and there is more recent conjecture by Decelle et
al., as well as Mossel Neeman Sly this is KMMNSZ Zhang 2013 paper, where they
come up with spectral algorithm they called the spectral redemption, and they
conjecture this one can achieve positive overlap when tau is larger than one.
But until November 2013 there was no proof that indeed the positive part of the
conjecture held but now we are in much better state of affairs because we have
two proofs. So I came up with one and then a week later Mossel Neeman and Sly
posited yet another proof so we have plenty of proofs with two different
methods to achieve positive overlap above the transition point. So let me now
tell you how this proof works and what is the method used to achieve positive
overlap above the transition point. So this is going to be done using a
modified spectral method and the keys to introduce the right matrix in which we
want to do spectral clustering, we are no longer working directly with the
original adjustments in matrix but instead we're constructing matrix which
counts adjustment C at a distance somehow and more precisely I take a pass
length parameter L and for each pair of nodes I and J I count the number of
self-avoiding paths in the graph between I and J. So this is what this matrix
B is about. And the typical situation is that for a node I, I would have a
tree-like local neighborhood in which case BIJ would be one leaf the graph
distance is precisely L between the nodes I and J. This is the typical case.
But if we have cycles then it might be different. For instance, the second
case here you may have two self-avoiding walks of distance L between I and J.
And you may ignore the third case for now. So the main result is about the
spectral structure of this matrix which then implies we can do some clustering
and achieve a positive overlap. So if we pick the path length to be
logarithmic in the system size, then the spectral structure of this matrix is
such that there is a leading eigenvector whose eigen -- there's a leading
eigenvalue, sorry, of the order alpha N to the order where it's A in this graph
A plus B over two. There's a second eigenvalue which is on the order of beta
raised to the L where beta is another key parameter in this Model, A minus B
over 2, and we know also that the corresponding eigenvectors are aligned with
vectors we know quite well. For the first eigenvalue the eigenvector is
aligned with vector obtained by applying the old one's vector to this matrix B.
And for the second eigenvalue the eigenvector is aligned with the vector
obtained by applying the spin vector to the spin matrix B and we have a third
and remaining eigenvectors which are O of essentially square root of alpha to
the L. So -- yes?
>> What's the dependence on this model C that it chooses N as C sub times log
N. This is true for ->> Laurent Massoulie: I need C to be positive but I need alpha C to be less
than one-fourth. So there is a constraint on C here which has to do with the
presence of cycles in the neighborhoods of the nodes.
But that's the range of parameter, parameter values considered, I can cope
with. And so the fact that the third eigenvalue up to the N to the epsilon for
an arbitrary positive epsilon is of the order of the square root of the first
is what I call a weak Ramanujan property. And so the final statement in the
main result is that the second eigenvector correlates with the underlying
communities so I can do thresholding on the second eigenvector and I will
achieve my community detection. All right. So what I want to do in the
remainder of the talk is describe the key ingredient in the proof and then
conclude. So this is just an illustration that it seems to work in practice.
So I tried it out and so here you should see the overlap become positive. It's
not completely obvious. Maybe I should do more simulations to confirm the
theory. But it does not disprove it at least.
>> Confirm the simulations?
>> Laurent Massoulie: So the key step is to introduce a matrix expression. So
in order to do that, what I'm introducing here is the expectation of the
adjustments in matrix conditional on the spins. So it's simple matrix with
essentially a ring two if you ignore the diagonal terms so this can be
expressed in terms of the old ones vector and the vector of spins in this way.
And based on this, it's useful to introduce now what I call a centered path in
matrix, which is essentially constructed as the matrix B was from the original
adjustment symmetrics, but it's now constructed from the centered adjusted
symmetrics so the IG entry is just the sum over self-avoiding walks of the I
and J of the products of these terms. Right. So once I have this at hand, I
can write an expansion. So I can consider this and say, well, I'll expand
those products and group them according to the place at which the last A bar
product appears. And what you see by doing that is that after this last A bar
product term appears you'll have only eight terms. And since you are
considering self-avoiding paths here, you will have essentially terms that
correspond more or less to the matrix B but to a corresponding to a shorter
path length. So you essentially find your matrix B is this perturbation matrix
plus some expansion which involves the matrix B with lower indices and
perturbation matrices with lower indices.
So this is the first step. And this is crucial because we can do some work on
this delta matrix whereas working with the B matrix is hard. So indeed we can
use classical tools for controlling spectral radial of random matrices to
control this delta matrix and basically we can use this trace method so we can
look at the trace of the matrix raised to some power and by combinatorial
arguments actually leveraging Heridian Comlos's [phonetic] work in the '80s.
So there's a paper by Heridian Comlos[phonetic] which does this kind of
control. We throw in the additional ingredient that we are considering
self-avoiding paths and this gives us some control in the spectral radius in
the end and essentially what we have is this perturbation matrix has spectral
radius ignoring the first term that is on the order of the square root of the
degree raised to the L. So this is one key ingredient in there. And then the
second ingredient is what you could think of as a local analysis, just need now
to work with the local structure of the neighborhoods in this graph. And so
the ingredient here is to show that if you look at the sizes of the
neighborhoods at distance T as well as the sum of spins at distance T, this has
some kind of a quasi deterministic growth pattern. This is what is written
here in red. That's the number of neighbors at distance C. This is roughly
the number at distance L scaled by constant plus some perturbation, similarly
for the sum of spins.
And so from this you can -- so this is an intermediate step. I'm nearly done.
So I'll put that together with the control of the spectral radius on the next
slide. So now if you look at the supreme of norm one vectors which are also
similar to the candidate eigenvectors VL and VL sigma, applied to B raised to
some minus 1 and here the E vector then you can control that. You get a square
root in term but you get also an alpha to the M of two term. And the reason
why this holds is that these vectors here are, in the precise sense close to
the vectors of sizes of neighborhoods and spin sums at given distances. So if
you force X to be orthogonal to these two things, then it will be orthogonal to
the first terms in here. So you will be left with the perturbation terms here
only. And so I'll put everything together now. We can show that if we
restrict ourselves to unit vectors orthogonal to the candidate eigenvectors
then VL times X has a norm that is on the order that we're interested in. That
is on the order of the square root of the average degree raised to the L. And
basically this is a combination of the ingredients I've just given. So the
expansion, the controls on the spectral radii of the coefficients in the
expansion and this last fact. So just putting it all together gives the
result.
So the rest is much easier. This is the most known use step in this proof and
you can conclude by leveraging coefficient theorem and controlling the norms of
the candidate eigenvectors applied to this B matrix, and this allows to
conclude about the spectral structure. And there is more work to be done in
order to show that the second eigenvector correlates positively with the
underlying community structure, but this is again the local analysis working on
the local neighborhoods of nodes characterizing how they behave, relating this
to random tree model that is the natural model for those neighborhoods.
So I'm done. Let me conclude now and mention some outlook. So in the key
method is that you can recover this Ramanujan-like spectral separation by using
these kinds of spectral, of path expansion techniques, working with this matrix
B rather than with the original adjacency matrix. So this may have
consequences beyond this highly stylized model. For instance, we have
generalization of the conjecture when we have a model with not just spins but
also labels on the edges which is something we introduced motivated by the
Netflix prize dataset, where the edges would be between a movie and a user and
they would be labeled with a number of stars that the user gave as a rating to
that movie.
So we have a generalization of this threshold phenomenon for those labels
stochastic block models and needs to be -- needs to be done to see if we can
generalize this path expansion to prove this generalized conjecture. And also
this technique may be used to prove the other conjecture made more recently by
the KMMNSZ paper. So I've not said what spectral redemption is, but I can say
in a nutshell. So the way they propose to identify the communities is to form
an edge-to-edge matrix and you would connect an edge to another edge in this
matrix, if they have a common endpoint. But you forbid so they're oriented
edges. So the head of the input edge should be the tail of the output edge,
and you prevent backtracking. So you cannot go back along an edge. So this
defines an edge-to-edge matrix. So their conjecture is above the threshold it
has one second eigenvalue that stands out and whose eigenvector can be
leveraged similarly as what I've been describing. So there is hope that we
could use the same method that I used with the matrix expansion and the trace
bound to establish this. And then there are a bunch of questions like I'm not
entirely sure that stochastic block models are a good model for all the
applications I just briefly mentioned.
So I would be interested in knowing whether this is the case. For one thing,
you certainly need to allow more flexibility like allowing general degrees
distrubution which Fan Chung has done to some extent in a recent paper with
Camlak Chadury [phonetic] but it's not clear it's a good model for the data
we're interested in and then there are plenty of other questions, the speed of
convergence, embedding dimension and so forth. So with this I will stop and
here are the references. So this is the paper with the results I was
describing and here is the other proof that is available of this conjecture by
Mossel Neeman and Sly. Thank you. [applause]
>> Questions.
>> So similar model the regular issue, so can define -- is there some block
model on the regular graph and are there similar questions that can be
answered?
>> Laurent Massoulie: Yes, you can.
>> You don't have this independent of operation ->> Laurent Massoulie: For instance, you could have some versions of the
configuration model so you could sample -- so fix the degrees, sample the
number of edges that are intra and inter. And there may be different models
for doing that. This could be done independently throwing coins to determine
that. And then once you have decided which edges are inter and intra, you do a
random matching of the half edges. And here there may not be this intermediate
phase, because the regularity may help, but I'm not entirely sure. You value
seems to be skeptical there may not be such an intermediate phase.
>> I haven't worked with this one. My guess there would be an intermediate
phase.
>> Laurent Massoulie: So there is one version where you say each node will
have three neighbors internal and two external and for this one I think maybe
there is no such intermediate phase. If you randomize, maybe it makes this
intermediate phase appear.
Another thing which we have been thinking is what could be said when you have
more than two communities because as you increase the number of communities,
things become even more interesting. So there is something happening when you
have five or more communities where you have a phase where the physicists tells
us below a transition point all spectral methods will fail, but still some
nonpolynomial methods should work. So we -- well, we could try to prove that
we have spectral methods working all the way up to that point. So these things
are better understood in the case of the reconstruction on the tree in which
Yuval has done lots of very interesting works. But so translating what is
known on the trees to the stochastic block model is already a challenge, and
this intermediate phase is quite a mystery.
>> Is the number 5, is it specifically five or five sort of known before
you ->> Laurent Massoulie: I think for up to four included this intermediate, this
additional phase does not exist. At least for the symmetric situation where
you have let's say four communities, two parameters A and B for inter and intra
edges. Then there would not be such an additional phase.
>> [indiscernible] intermediate or the result -- two immediate
[indiscernible].
>> Laurent Massoulie: Right now it's returned for four. I'm confident that it
could be extended. But I have not done that. So I'll be cautious.
>> Even in the case of trees, there was a big gap between case of two
[indiscernible], the number of colors even from two to three, turned out that
the spectral methods were sharp. But it was much, much easier for two than for
three. And then the fact that they break down at five. [indiscernible] should
have ended up here.
>> Laurent Massoulie: Yes. Yes. So discussing with Lancaster Borava
[phonetic] one of the authors of the original conjecture, she tells me she
thinks this will have an analog and in this hard wrench no polynomial time
algorithm should be able to achieve positive overlap, whereas maximum
likelihood would somehow. But I don't know what is the basis for this guess
that they are making.
>> Believe at the level maximum likelihood would and no spectral efforts,
generalized to all polynomial [indiscernible].
>> Laurent Massoulie: Yes, yes, I guess this has to do with ->> Sophisticated polynomial.
>> Laurent Massoulie: Yes, I asked her, and I guess it has to do with the work
on the energy long scale that they've been doing on the -- I forget the name,
but there was an attempt to make message passing work even beyond the known
limit and they came up with a survey propagation method and so they developed
an understanding of the attractors of those iterative schemes. So I guess this
comes from there. But anyhow....
>> Yuval Peres: Okay. Thank you again.
[applause]
Download