Community Detection - UBC Department of Computer Science

Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69, 026113 (2004). M. E. J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E 69, 066133 (2004). The Problem   Can we partition the network into groups s.t. the inter-group edges are sparse while the intra-group edges are dense? Why is it interesting/useful? ◦ Understanding comm. structure – means to understanding n/w structure. ◦ Graph partitioning – similar problem; graph of processes, edges=communication; assign subgraphs to processors to minimize inter-processor comm. & balance processor load. (NP-hard in general.) ◦ Diff. w/ graph partitioning. An Example with Three Communities A Hierarchical Clustering Approach 1. 2. 3. 4. 5. 6. Define a notion of similarity or affinity between nodes. E.g.: 𝑠𝑖𝑚(𝑥, 𝑦) := #node-disjoint paths between 𝑥 and 𝑦. 𝑠𝑖𝑚(𝑥, 𝑦) := #edge-disjoint paths between 𝑥 and 𝑦. 𝑠𝑖𝑚(𝑥, 𝑦) := weighted sum of all paths, with longer paths weighted down, e.g., Katz! Qn: how can we compute #2, 3 fast? (Efficient algorithms for Katz have been developed.) Community detection via hierarchical clustering Compute all pairwise node similarities for every edge present.  Repeatedly add edges with greatest similarity.   leads to a tree (called dendrogram).  A slice throguh the dendogram represents a clustering or comm. structure.  Dendrogram example Limitations of HC approach “Misplaces” nodes in the periphery. 1  E.g.: 5  2 4 3 Which community should 5 belong to?  Alternative approach based on “edge betweenness”. Key Intuition An inter-comm. edge has a higher “betweenness” compared to an intracomm. edge, i.e., more paths between node pairs pass through it.  Start with G.  Repeatedly remove edges with highest betweenness until <some stopping criterion>.  Communities = resulting components.  Basic Algorithm  repeat { ◦ Calculate betweenness of all edges; ◦ Remove one with highest betweenness, breaking ties arbitrarily; }   Until no edges left. Remarks: ◦ Which betweenness score? ◦ Calculate upfront and reuse or recalculate? ◦ Can we incrementally recalculate after each edge removal? ◦ Related algorithms for node betweenness by Newman and Brandes. A Real Example (Zachary’s Karate Club) With recalculation of betweenness. Without recalculation of betweenness. Scalability Issues Edge betweenness for all edges can be computed in time 𝑂(𝑚𝑛) (𝑚=#edges, 𝑛=#nodes). [Newman 2001] – details soon.  Recalculation makes algorithm 𝑂(𝑚2𝑛), so not feasible for large networks.  Computing edge betweenness  An Example b d a g c e Compute #geodesics from every node to g. f Breadth-first search – means for doing many things. Computing edge betweenness  An Example b d d=0 w=1 a c g e f Breadth-first search – means for doing many things. Computing edge betweenness  An Example b d=1 w=1 d d=0 w=1 a c e d=1 w=1 g f Breadth-first search – means for doing many things. Computing edge betweenness  An Example d=2 w=2 b d=1 w=1 d d=0 w=1 a c d=2 w=2 e d=1 w=1 f g d=2 w=2 Breadth-first search – means for doing many things. Computing edge betweenness  An Example d=2 w=2 d=3 w=4 b d=1 w=1 d d=0 w=1 a c d=2 w=2 e d=1 w=1 f g Have all info. we need for edge betweenness now. d=2 w=2 Breadth-first search – means for doing many things. Computing edge betweenness  An Example d=2 w=2 d=3 2/4 w=4 b d=1 w=1 d 1/2 a 2/4 d=0 w=1 c d=2 w=2 e d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things. Computing edge betweenness  An Example d=2 w=2 d=3 2/4 w=4 b d=1 w=1 ½(1+2/4) d 1/2 a 2/4 d=0 w=1 c ½(1+2/4) d=2 w=2 e d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things. Computing edge betweenness  An Example d=2 w=2 d=3 2/4 w=4 b 1/1[ 1+½(1+2/4)+1/2(1+2/4)+1/2] d=1 w=1 ½(1+2/4) d 1/2 a 2/4 d=0 w=1 c ½(1+2/4) d=2 w=2 e d=1 w=1 1/2 f g Note: a and f are like leaves: no geodesic to g from other nodes passes through them. d=2 w=2 Breadth-first search – means for doing many things. EB Computation summary For any one target node, compute weights of nodes by BFS; 𝑤(𝑥) = #geodesics from 𝑥 to target.  Suppose 𝑥 𝑦 rest of 𝐺 (containing target).  Then intuitively, 𝑤(𝑦)/𝑤(𝑥) of the geodesics from 𝑥 to the target node go through 𝑦.  EB Computation summary (contd.) For any edge 𝑥𝑦 (𝑥 further from target than 𝑦), 𝑏𝑒𝑡(𝑥𝑦) = 𝑤(𝑦)/𝑤(𝑥)[1 + 𝑠𝑢𝑚 𝑜𝑓 𝑏𝑒𝑡 𝑜𝑓 𝑎𝑙𝑙 𝑒𝑑𝑔𝑒𝑠 “𝑏𝑒𝑓𝑜𝑟𝑒” 𝑡ℎ𝑖𝑠 𝑒𝑑𝑔𝑒].  The above is wrt a specific target node.  Overall bet for any edge = sum of bet wrt every node treated as target node.  EB computation – complexity analysis For any one target node, BFS gives bet of every edge w.r.t. that target node, in 𝑂(𝑚) time.  Doing so for every node treated as target node  𝑂(𝑚𝑛) time for final betweenness score for every edge.  Quite elegant, but recalculation bumps up complexity to 𝑂(𝑚2𝑛).  Need more scalable approaches for CD.  On scaling up CD algorithm  determine intelligently which edges need their bet recalculated, when an edge is removed. ◦ When 𝑒 is removed, 𝑏𝑒𝑡(𝑒’) needs to be recalculated only if 𝑒’ is in the same connected component as 𝑒.  For a very large component, doesn’t prune much.  ◦ Perhaps it’s only important to determine the edge with the next highest bet.  can we maintain enough “state” so that when 𝑒 is removed, we can recalculate 𝑏𝑒𝑡(𝑒’) incrementally, i.e., not from scratch? Point to ponder! Closing Remarks 1/2 Newman also proposed other bases for defining edge betweenness.  Electrical current flow through the edge where every edge is viewed as unit resistance and we consider all source-sink pairs.  Based on random walks.  Both less effective and more expensive than geodesics (see paper for details).  What about directed and weighted cases?  Closing Remarks 2/2 Goodness metric of community division. Helpful when we don’t know the ground truth. Q = ∑i (eii – ai2 ), where Ekxk= matrix of community division: eij = fraction of edges linking comm. i to comm. j; ai = ∑j eij . Q measures fraction of intra-comm. edges over what is expected by chance (assuming uniform distribution). See paper for details of experimental results.  Turns out study of influence/information propagation can suggest new ways of detecting communities: will revisit this issue after we study influence propagation.    Recommended Reading J. Ruan and W. Zhang. An Efcient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks. ICDM 2007.  M. E. J. Newman "Modularity and community structure in networks", physics/0602124 = Proceedings of the National Academy of Sciences (USA) 103 (2006): 87577—8582.  Jure Leskovec, Kevin J. Lang, and Michael W. Mahoney. Empirical Comparison of Algorithms for Network Community Detection. WWW 2010.  M. E. J. Newman. Communities, modules and large-scale structure in networks. Nature Physics 8, 25–31 (2012) doi:10.1038/nphys2162 Received 23 September 2011 Accepted 04 November 2011 Published online 22 December 2011. 

Community Detection - UBC Department of Computer Science

Related documents

Products

Support

Community Detection - UBC Department of Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib