Full Talk

advertisement
Community Detection Algorithm and Community Quality Metric
Mingming Chen & Boleslaw K. Szymanski
Department of Computer Science
Rensselaer Polytechnic Institute
Community Structure
 Many networks display community structure
 Groups of nodes within which connections are
denser than between them
Community detection algorithms
Community quality metrics
Two Related Community Detection Topics
 Community detection algorithm
 LabelRank: a stabilized label propagation
community detection algorithm Xie and Symanski, 2013.
 LabelRankT: extended algorithm for dynamic
networks based on LabelRank Xie, Chen, and Symanski, 2013.
 A new community quality metric solving two
problems of Modularity
M. E. J. Newman, 2006;
Newman and Girvan, 2004.
LabelRank Algorithm
 Four operators applied to the labels




No
Label propagation operator
Inflation operator
Cutoff operator
Conditional update operator
No
2
1
No
1
1
1
3
Question: NP=P ?
Node 1: No;
Node 2: No;
Node 3: No;
Node 4: Yes.
197
Yes
4
PP
1 (No)=3/100;
1 (No)=3/4;
PP
1 (Yes)=97/100.
1 (Yes)=1/4.
Node 1: Yes.
No.
Label Propagation Operator
W P
 where W is the n x n weighted adjacent matrix. P is the
n x n label probability distribution matrix which is
composed of n (1 x n) row vectors Pi, one for each node
 Each element Pi(c) holds the current estimation of
probability of node i observing label c  C , where C is
the set of labels (here, suppose C={1, 2, …, n})
 Ex. Pi=(0.1, 0.2, …, 0.05, …)
 To initialize P, each node is assigned a distribution of
probabilities of all incoming edges
Pi (c ) 

wic
k Nb ( i )
wik
, c  C s.t. wic  0.
Label Propagation Operator
 Each node receives the label probability distribution
from its neighbors and computes the new distribution

P (c ) 

jNb ( i )
wij Pj (c )
i
k Nb ( i )
wik
, c  C.
P3= (0.25, 0, 0.25, 0, 0, 0, 0.25, 0.25, 0, 0)
P1= (0.25, 0.25, 0.25, 0.25, 0, 0, 0, 0, 0, 0)
P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625)
P2= (0.25, 0.25, 0, 0, 0.25, 0.25, 0, 0, 0, 0)
P4= (0.25, 0, 0, 0.25, 0, 0, 0, 0, 0.25, 0.25)
Inflation Operator
 Each element Pi(c) rises to the inth power:
 in Pi (c ) 
Pi (c )in
in
P
(
j
)
 i
jC
 It increases probabilities of labels with high probability
but decreases that of labels with low probabilities during
label propagation.
P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625)
in (in  2)
P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806)
Cutoff Operator
 The cutoff operator  r on P removes labels that are
below the threshold r  [0,1] with the help from Inflation
Operator that decreases probabilities of labels with low
probabilities during propagation.
  r efficiently reduces the space complexity from
quadratic to linear.
P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806)
 r (r  0.1)
P1= (0.129)
With r = 0.1, the average
number of labels in each
node is less than 3.
Conditional Update Operator
 At each iteration, it updates a node i only when it is
significantly different from its incoming neighbors in
terms of labels:

jNb ( i )
isSubset (Ci* , C *j )  qki ,
where Ci* is the set of maximum probability labels at
node i at the last step. isSubset ( s1 , s2 ) returns 1 if s1  s2
and 0 otherwise. ki is the node degree and q∈ [0,1].
 isSubset can be viewed as a measure of similarity
between two nodes.
Effect of Conditional Update Operator
Running time of LabelRank
 O(Tm): m is the number of edges and T is the number
of iterations.
LabelRank is a linear algorithm
Performance of LabelRank
LabelRankT
 It is a LabelRank with one extra conditional update rule
by which only nodes involved changes will be updated.
Changes are handled by comparing neighbors of node i
at two consecutive steps, Nbt 1 (i ) and Nbt (i ) .
Two Problems of Modularity Maximization
 Split large communities
 Favor small communities
 Resolution limit problem
 Modularity optimization may fail to discover
communities smaller than a scale even in cases
where communities are unambiguously defined.
 This scale depends on the total number of edges in
the network and the degree of interconnectedness
of the communities.
 Favor large communities
Fortunato et al, 2008; Li et al, 2008; Arenas et al, 2008; Berry et al, 2009;
Good et al, 2010; Ronhovde et al, 2010; Fortunato, 2010; Lancichinetti et
al, 2011; Traag et al, 2011; Darst et al, 2013.
Modularity
 Modularity (Q): the fraction of edges falling within
communities minus the expected value in an equivalent
network with edges placed at random
ki k j 

1
Q

 Aij 
 ci ,c j ,
2 | E | ij 
2 | E |
 c ,c
i
j
1

0
M. E. J. Newman, 2006.
if nodes i and j in the same community,
otherwise.
 Equivalent definition
Newman and Girvan, 2004.
 | E in |  2 | E in |  | E out |  2 
ci
ci
ci
Q  

 ,

2| E |
ci  | E |

 

| Ecini |: the number of intra edges of Community ci ;
|c |
| Ecout
|: the number of inter edges of Community ci .
i
Modularity with Split Penalty
 Modularity (Q): the modularity of the community
detection result
ki k j 

1
Q

 Aij 
 ci ,c j .
2 | E | ij 
2 | E |
 Split penalty (SP): the fraction of edges that connect
nodes of different communities
1
SP 
Aij (1  ci ,c j ).

2 | E | ij
 Qs = Q – SP: solving the problem, favoring small
communities, of Modularity
ki k j 

1
1
Qs  Q  SP 
A



Aij (1   ci ,c j ).


 ij
 ci ,c j
2 | E | ij 
2 | E |
2 | E | ij
Qs with Community Density
 Resolution limit: Modularity optimization may fail to
detect communities smaller than a scale
 Intuitively, put density into Modularity and Split Penalty
to solve the resolution limit problem
ki k j 2 

1
1
Qds 
d ci  ci ,c j 
Aij d ci ,c j (1   ci ,c j )


 Aij d ci 
2 | E | ij 
2| E | 
2 | E | ij
d ci 
| Ecini |
| ci | (| ci | 1) / 2
d ci ,c j 
| Eci ,c j |
| ci || c j |
 Equivalent definition
 in

2
in
out
|C | | E
|


|
E
|
2
|
E
|

|
E
|
ci , c j
 c

ci
ci
Qds    i d ci  
d ci   
d ci ,c j 


|E|
2| E |
ci

 c cj c 2 | E |


j
i


|c|
Example of Two Well-Separated Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.5
0
0.5
0.5
1 community
0
0
0
0.245
Example of Two Weakly Connected Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.357
0.143
0.214
0.339
1 community
0
0
0
0.25
Ambiguity between One and Two Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.3
0.2
0.1
0.263
1 community
0
0
0
0.249
Ambiguity between One and Two Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.25
0.25
0
0.188
1 community
0
0
0
0.245
Example of One Well Connected Community
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.167
0.333
-0.167
0.0417
1 community
0
0
0
0.23
Example of One Very Well Connected Community
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.0455
0.455
-0.409
-0.239
1 community
0
0
0
0.168
Example of One Complete Graph
Community Quality on a complete graph with 8 nodes
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities -0.0714
0.571
-0.643
-0.643
1 community
0
0
0
0
Modularity Has Nothing to Do with #Nodes
 12  13  2 
Q (clique)  Q(tree)  2 * 

   0.4231;
 26  26  
 12  13  2
1 
Qs (clique)  Qs (tree)  2 * 

  0.3462;
 
26 
 26  26 
2
 12
13
1
1 


Qds (clique)  2 *  *1  
*1  
*
  0.4183;
26 4 * 4 
 26 
 26
 12 2  13 2  2
1
1 
Qds (tree)  2 *  *  
*  
*
  0.2214.
26
7
26
7
26
7
*
7




5-clique Example
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
30 communities 0.8758
0.09091
0.7848
0.8721
15 communities 0.8879
0.04545
0.8424
0.4305
∆Qs=(0.8424-0.7848)=0.0576 >
∆Q=(0.8879-0.8758)=0.0121
Thanks!
Q&A
Example of Two Weakly Connected Communities
Modularity (Q) Split Penalty (SP) Qs = Q – SP
Qds
2 communities 0.309
0.25
0.0586
0.264
1 community
-0.00586
0.125
-0.131
0.202
Download