Community Detection - UBC Department of Computer Science

advertisement
Community Detection
Laks V.S. Lakshmanan
(based on Girvan & Newman. Finding and evaluating community
structure in networks. Physical Review E 69, 026113 (2004).
M. E. J. Newman. Fast algorithm for detecting community structure
in networks. Physical Review E 69, 066133 (2004).
The Problem


Can we partition the network into groups
s.t. the inter-group edges are sparse while
the intra-group edges are dense?
Why is it interesting/useful?
◦ Understanding comm. structure – means to
understanding n/w structure.
◦ Graph partitioning – similar problem; graph of
processes, edges=communication; assign subgraphs to processors to minimize inter-processor
comm. & balance processor load. (NP-hard in
general.)
◦ Diff. w/ graph partitioning.
An Example with Three
Communities
A Hierarchical Clustering Approach
1.
2.
3.
4.
5.
6.
Define a notion of similarity or affinity between
nodes.
E.g.: 𝑠𝑖𝑚(𝑥, 𝑦) := #node-disjoint paths between
𝑥 and 𝑦.
𝑠𝑖𝑚(𝑥, 𝑦) := #edge-disjoint paths between
𝑥 and 𝑦.
𝑠𝑖𝑚(𝑥, 𝑦) := weighted sum of all paths, with
longer paths weighted down, e.g., Katz!
Qn: how can we compute #2, 3 fast?
(Efficient algorithms for Katz have been
developed.)
Community detection via
hierarchical clustering
Compute all pairwise node similarities for
every edge present.
 Repeatedly add edges with greatest
similarity.
  leads to a tree (called dendrogram).
 A slice throguh the dendogram
represents a clustering or comm.
structure.

Dendrogram example
Limitations of HC approach
“Misplaces” nodes in the periphery.
1
 E.g.:
5

2
4
3
Which community should 5 belong to?
 Alternative approach based on “edge
betweenness”.
Key Intuition
An inter-comm. edge has a higher
“betweenness” compared to an intracomm. edge, i.e., more paths between
node pairs pass through it.
 Start with G.
 Repeatedly remove edges with highest
betweenness until <some stopping
criterion>.
 Communities = resulting components.

Basic Algorithm

repeat {
◦ Calculate betweenness of all edges;
◦ Remove one with highest betweenness, breaking
ties arbitrarily; }


Until no edges left.
Remarks:
◦ Which betweenness score?
◦ Calculate upfront and reuse or recalculate?
◦ Can we incrementally recalculate after each edge
removal?
◦ Related algorithms for node betweenness by
Newman and Brandes.
A Real Example (Zachary’s Karate Club)
With recalculation of betweenness.
Without recalculation of betweenness.
Scalability Issues
Edge betweenness for all edges can be
computed in time 𝑂(𝑚𝑛) (𝑚=#edges,
𝑛=#nodes). [Newman 2001] – details
soon.
 Recalculation makes algorithm 𝑂(𝑚2𝑛),
so not feasible for large networks.

Computing edge betweenness

An Example
b
d
a
g
c
e
Compute
#geodesics from
every node to g.
f
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
b
d
d=0
w=1
a
c
g
e
f
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
b
d=1
w=1
d
d=0
w=1
a
c
e
d=1
w=1
g
f
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
d=2
w=2
b
d=1
w=1
d
d=0
w=1
a
c
d=2
w=2
e
d=1
w=1
f
g
d=2
w=2
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
d=2
w=2
d=3
w=4
b
d=1
w=1
d
d=0
w=1
a
c
d=2
w=2
e
d=1
w=1
f
g
Have all info.
we need for
edge
betweenness
now.
d=2
w=2
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
d=2
w=2
d=3 2/4
w=4
b
d=1
w=1
d
1/2
a
2/4
d=0
w=1
c
d=2
w=2
e
d=1
w=1
1/2
f
g
Note: a and f
are like leaves:
no geodesic
to g from other
nodes passes
through them.
d=2
w=2
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
d=2
w=2
d=3 2/4
w=4
b
d=1
w=1
½(1+2/4) d
1/2
a
2/4
d=0
w=1
c
½(1+2/4)
d=2
w=2
e
d=1
w=1
1/2
f
g
Note: a and f
are like leaves:
no geodesic
to g from other
nodes passes
through them.
d=2
w=2
Breadth-first search – means for doing many
things.
Computing edge betweenness

An Example
d=2
w=2
d=3 2/4
w=4
b
1/1[ 1+½(1+2/4)+1/2(1+2/4)+1/2]
d=1
w=1
½(1+2/4) d
1/2
a
2/4
d=0
w=1
c
½(1+2/4)
d=2
w=2
e
d=1
w=1
1/2
f
g
Note: a and f
are like leaves:
no geodesic
to g from other
nodes passes
through them.
d=2
w=2
Breadth-first search – means for doing many
things.
EB Computation summary
For any one target node, compute
weights of nodes by BFS; 𝑤(𝑥) =
#geodesics from 𝑥 to target.
 Suppose 𝑥 𝑦 rest of 𝐺 (containing
target).
 Then intuitively, 𝑤(𝑦)/𝑤(𝑥) of the
geodesics from 𝑥 to the target node go
through 𝑦.

EB Computation summary (contd.)
For any edge 𝑥𝑦 (𝑥 further from target than 𝑦),
𝑏𝑒𝑡(𝑥𝑦) = 𝑤(𝑦)/𝑤(𝑥)[1 +
𝑠𝑢𝑚 𝑜𝑓 𝑏𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑒𝑑𝑔𝑒𝑠 “𝑏𝑒𝑓𝑜𝑟𝑒” 𝑡ℎ𝑖𝑠 𝑒𝑑𝑔𝑒].
 The above is wrt a specific target node.
 Overall bet for any edge = sum of bet wrt every
node treated as target node.

EB computation – complexity
analysis
For any one target node, BFS gives bet of
every edge w.r.t. that target node, in
𝑂(𝑚) time.
 Doing so for every node treated as target
node  𝑂(𝑚𝑛) time for final
betweenness score for every edge.
 Quite elegant, but recalculation bumps up
complexity to 𝑂(𝑚2𝑛).
 Need more scalable approaches for CD.

On scaling up CD algorithm

determine intelligently which edges need
their bet recalculated, when an edge is
removed.
◦ When 𝑒 is removed, 𝑏𝑒𝑡(𝑒’) needs to be
recalculated only if 𝑒’ is in the same connected
component as 𝑒.
 For a very large component, doesn’t prune much. 
◦ Perhaps it’s only important to determine the edge
with the next highest bet.

can we maintain enough “state” so that
when 𝑒 is removed, we can recalculate
𝑏𝑒𝑡(𝑒’) incrementally, i.e., not from scratch?
Point to ponder!
Closing Remarks 1/2
Newman also proposed other bases for
defining edge betweenness.
 Electrical current flow through the edge
where every edge is viewed as unit
resistance and we consider all source-sink
pairs.
 Based on random walks.
 Both less effective and more expensive
than geodesics (see paper for details).
 What about directed and weighted cases?

Closing Remarks 2/2
Goodness metric of community division.
Helpful when we don’t know the ground truth.
Q = ∑i (eii – ai2 ), where
Ekxk= matrix of community division: eij = fraction of
edges linking comm. i to comm. j; ai = ∑j eij .
Q measures fraction of intra-comm. edges over
what is expected by chance (assuming uniform
distribution).
See paper for details of experimental results.
 Turns out study of influence/information
propagation can suggest new ways of detecting
communities: will revisit this issue after we study
influence propagation.



Recommended Reading
J. Ruan and W. Zhang. An Efcient Spectral Algorithm for
Network Community Discovery and Its Applications to
Biological and Social Networks. ICDM 2007.
 M. E. J. Newman "Modularity and community structure in
networks", physics/0602124 = Proceedings of the National
Academy of Sciences (USA) 103 (2006): 87577—8582.
 Jure Leskovec, Kevin J. Lang, and Michael W. Mahoney.
Empirical Comparison of Algorithms for Network
Community Detection. WWW 2010.
 M. E. J. Newman. Communities, modules and large-scale
structure in networks. Nature Physics 8, 25–31 (2012)
doi:10.1038/nphys2162 Received 23 September 2011
Accepted 04 November 2011 Published online 22 December
2011.

Download