Community Detection Algorithm and Community Quality Metric Mingming Chen & Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic Institute Community Structure Many networks display community structure Groups of nodes within which connections are denser than between them Community detection algorithms Community quality metrics Two Related Community Detection Topics Community detection algorithm LabelRank: a stabilized label propagation community detection algorithm Xie and Symanski, 2013. LabelRankT: extended algorithm for dynamic networks based on LabelRank Xie, Chen, and Symanski, 2013. A new community quality metric solving two problems of Modularity M. E. J. Newman, 2006; Newman and Girvan, 2004. LabelRank Algorithm Four operators applied to the labels No Label propagation operator Inflation operator Cutoff operator Conditional update operator No 2 1 No 1 1 1 3 Question: NP=P ? Node 1: No; Node 2: No; Node 3: No; Node 4: Yes. 197 Yes 4 PP 1 (No)=3/100; 1 (No)=3/4; PP 1 (Yes)=97/100. 1 (Yes)=1/4. Node 1: Yes. No. Label Propagation Operator W P where W is the n x n weighted adjacent matrix. P is the n x n label probability distribution matrix which is composed of n (1 x n) row vectors Pi, one for each node Each element Pi(c) holds the current estimation of probability of node i observing label c C , where C is the set of labels (here, suppose C={1, 2, …, n}) Ex. Pi=(0.1, 0.2, …, 0.05, …) To initialize P, each node is assigned a distribution of probabilities of all incoming edges Pi (c ) wic k Nb ( i ) wik , c C s.t. wic 0. Label Propagation Operator Each node receives the label probability distribution from its neighbors and computes the new distribution P (c ) jNb ( i ) wij Pj (c ) i k Nb ( i ) wik , c C. P3= (0.25, 0, 0.25, 0, 0, 0, 0.25, 0.25, 0, 0) P1= (0.25, 0.25, 0.25, 0.25, 0, 0, 0, 0, 0, 0) P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625) P2= (0.25, 0.25, 0, 0, 0.25, 0.25, 0, 0, 0, 0) P4= (0.25, 0, 0, 0.25, 0, 0, 0, 0, 0.25, 0.25) Inflation Operator Each element Pi(c) rises to the inth power: in Pi (c ) Pi (c )in in P ( j ) i jC It increases probabilities of labels with high probability but decreases that of labels with low probabilities during label propagation. P1= (0.25, 0.125, 0.125, 0.125, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625, 0.0625) in (in 2) P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806) Cutoff Operator The cutoff operator r on P removes labels that are below the threshold r [0,1] with the help from Inflation Operator that decreases probabilities of labels with low probabilities during propagation. r efficiently reduces the space complexity from quadratic to linear. P1= (0.129, 0.0323, 0.0323, 0.0323, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806, 0.00806) r (r 0.1) P1= (0.129) With r = 0.1, the average number of labels in each node is less than 3. Conditional Update Operator At each iteration, it updates a node i only when it is significantly different from its incoming neighbors in terms of labels: jNb ( i ) isSubset (Ci* , C *j ) qki , where Ci* is the set of maximum probability labels at node i at the last step. isSubset ( s1 , s2 ) returns 1 if s1 s2 and 0 otherwise. ki is the node degree and q∈ [0,1]. isSubset can be viewed as a measure of similarity between two nodes. Effect of Conditional Update Operator Running time of LabelRank O(Tm): m is the number of edges and T is the number of iterations. LabelRank is a linear algorithm Performance of LabelRank LabelRankT It is a LabelRank with one extra conditional update rule by which only nodes involved changes will be updated. Changes are handled by comparing neighbors of node i at two consecutive steps, Nbt 1 (i ) and Nbt (i ) . Two Problems of Modularity Maximization Split large communities Favor small communities Resolution limit problem Modularity optimization may fail to discover communities smaller than a scale even in cases where communities are unambiguously defined. This scale depends on the total number of edges in the network and the degree of interconnectedness of the communities. Favor large communities Fortunato et al, 2008; Li et al, 2008; Arenas et al, 2008; Berry et al, 2009; Good et al, 2010; Ronhovde et al, 2010; Fortunato, 2010; Lancichinetti et al, 2011; Traag et al, 2011; Darst et al, 2013. Modularity Modularity (Q): the fraction of edges falling within communities minus the expected value in an equivalent network with edges placed at random ki k j 1 Q Aij ci ,c j , 2 | E | ij 2 | E | c ,c i j 1 0 M. E. J. Newman, 2006. if nodes i and j in the same community, otherwise. Equivalent definition Newman and Girvan, 2004. | E in | 2 | E in | | E out | 2 ci ci ci Q , 2| E | ci | E | | Ecini |: the number of intra edges of Community ci ; |c | | Ecout |: the number of inter edges of Community ci . i Modularity with Split Penalty Modularity (Q): the modularity of the community detection result ki k j 1 Q Aij ci ,c j . 2 | E | ij 2 | E | Split penalty (SP): the fraction of edges that connect nodes of different communities 1 SP Aij (1 ci ,c j ). 2 | E | ij Qs = Q – SP: solving the problem, favoring small communities, of Modularity ki k j 1 1 Qs Q SP A Aij (1 ci ,c j ). ij ci ,c j 2 | E | ij 2 | E | 2 | E | ij Qs with Community Density Resolution limit: Modularity optimization may fail to detect communities smaller than a scale Intuitively, put density into Modularity and Split Penalty to solve the resolution limit problem ki k j 2 1 1 Qds d ci ci ,c j Aij d ci ,c j (1 ci ,c j ) Aij d ci 2 | E | ij 2| E | 2 | E | ij d ci | Ecini | | ci | (| ci | 1) / 2 d ci ,c j | Eci ,c j | | ci || c j | Equivalent definition in 2 in out |C | | E | | E | 2 | E | | E | ci , c j c ci ci Qds i d ci d ci d ci ,c j |E| 2| E | ci c cj c 2 | E | j i |c| Example of Two Well-Separated Communities Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.5 0 0.5 0.5 1 community 0 0 0 0.245 Example of Two Weakly Connected Communities Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.357 0.143 0.214 0.339 1 community 0 0 0 0.25 Ambiguity between One and Two Communities Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.3 0.2 0.1 0.263 1 community 0 0 0 0.249 Ambiguity between One and Two Communities Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.25 0.25 0 0.188 1 community 0 0 0 0.245 Example of One Well Connected Community Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.167 0.333 -0.167 0.0417 1 community 0 0 0 0.23 Example of One Very Well Connected Community Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.0455 0.455 -0.409 -0.239 1 community 0 0 0 0.168 Example of One Complete Graph Community Quality on a complete graph with 8 nodes Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities -0.0714 0.571 -0.643 -0.643 1 community 0 0 0 0 Modularity Has Nothing to Do with #Nodes 12 13 2 Q (clique) Q(tree) 2 * 0.4231; 26 26 12 13 2 1 Qs (clique) Qs (tree) 2 * 0.3462; 26 26 26 2 12 13 1 1 Qds (clique) 2 * *1 *1 * 0.4183; 26 4 * 4 26 26 12 2 13 2 2 1 1 Qds (tree) 2 * * * * 0.2214. 26 7 26 7 26 7 * 7 5-clique Example Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 30 communities 0.8758 0.09091 0.7848 0.8721 15 communities 0.8879 0.04545 0.8424 0.4305 ∆Qs=(0.8424-0.7848)=0.0576 > ∆Q=(0.8879-0.8758)=0.0121 Thanks! Q&A Example of Two Weakly Connected Communities Modularity (Q) Split Penalty (SP) Qs = Q – SP Qds 2 communities 0.309 0.25 0.0586 0.264 1 community -0.00586 0.125 -0.131 0.202