slides

advertisement
Community detection
algorithms:
a comparative analysis
Santo Fortunato
“Communities”
More links “inside” than “outside”
Graphs are “sparse”
Metabolic
Protein-protein
Social
Economical
Problems
• Confusion about the main concepts: community,
partition, null models
• (Too) Many algorithms around
• How shall we test them?
Testing a method means applying it to graphs with
know community structure (benchmarks)
Benchmarks are then based on an implicit definition of
community
Ideally algorithms have to be based on the same
definition/principle, otherwise there is inconsistency
The planted l-partition model
(Condon & Karp, 1999)
n nodes, l equal-sized groups with g=n/l nodes
p = probability that two nodes in the same group are
connected
q = probability that two nodes in different groups are
connected
If p>q, communities are there!
Benchmark of Girvan & Newman
128 nodes, 4 groups, average degree 16
All nodes have the same degree
Special case of planted l-partition model, with n=128,
l=4, g=32
Problems with GN benchmark
• All nodes have the same degree
• All communities have equal size
In real networks the distributions of degree and
community size is highly heterogeneous!
New benchmark (A. Lancichinetti, S. F., F.
Radicchi, Phys. Rev. E 78, 046110, 2008)
• Power law distribution of degree
• Power law distribution of community size
• A mixing parameter μt sets the ratio between the
external and the total degree of each node
The benchmark can be extended to directed and
weighted networks with overlapping communities
(A. Lancichinetti, S. F., Phys. Rev. E 80, 016118, 2009)
The software to produce all new benchmarks is here:
http://santo.fortunato.googlepages.com/inthepress2
Computer time
Comparing partitions:
normalized mutual information
xi, yi : community assignments
P(X=x)=nx/n, P(Y=y)=ny/n
Joint distribution: P(X=x, Y=y)= nxy/n
Shannon entropy of X:
Shannon conditional entropy of X given Y:
Mutual information
Problem: the mutual information is identical for all Y
which are subpartitions of X
To avoid that: normalized mutual information
What is the best algorithm? A comparative analysis
(A. Lancichinetti, S.F., Phys. Rev. E 80, 056117, 2009)
Divisive algorithms
Principle: one removes the links that connect the clusters,
until the latter are isolated
How to identify intercommunity links?
1) Edge-betweenness
(M. Girvan & M.E.J Newman,
PNAS 99, 7821-7826, 2002)
2) Edge clustering coefficient
(F. Radicchi, C. Castellano, F.
Cecconi, V. Loreto, D. Parisi,
PNAS 101, 2658, 2004)
Modularity
Newman & Girvan, Phys. Rev. E 69, 026113, 2004
= # links in module i
= expected # of links in module i
li
Infomap
(Rosvall & Bergstrom, PNAS 105, 1118, 2008)
Best
partition

minimum
description
length,
optimization
can be carried out with simulated annealing, greedy
methods, etc.
Clique Percolation Method
Palla, Derényi, Farkas & Vicsek, Nature 435, 814, (2005)
Principle: in a graph with community structure there
are many cliques within the clusters
Cliques can be used as probes to explore the graph:
1) Two k-cliques are neighbors if they share a
(k-1)-clique
2) One can travel along paths of neighboring cliques
Cliques may be trapped within clusters, which can
then be identified
Clique percolation method
What is the best algorithm? A comparative analysis
(A. Lancichinetti, S.F., Phys. Rev. E 80, 056117, 2009)
Tests on GN benchmark
Tests on LFR benchmark
(undirected, unweighted)
Tests on random graphs
Outlook
Agreement on how to test algorithms is more crucial than
designing algorithms!
• New benchmark graphs based on planted l-partition
model
(true
community
definition?):
weighted/unweighted,
directed/undirected
and
with
overlapping communities
• Comparative analysis of existing methods on new
benchmarks: the method by Rosvall and Bergstrom
(PNAS 105, 1118, 2008) is the best: very good on the new
benchmarks, it also recognizes random graphs, if the
average degree is not too small, it is fast as well!
• Warning: benchmarks are characterized by “flat”
clustering, there is no hierarchy! Low clustering
coefficient too (work in progress)
• Crucial issue for the future: proper definition of
hierarchical community structure and relative testing!
S. F., arXiv:0906.0612, Physics Reports 486, 75-174 (2010)
Download