Efficient and Effective Local Algorithms for Analyzing Massive Graphs Yubao Wu Case Western Reserve University August 28, 2015 Graphs are Everywhere Social Networking Websites Biological Networks • • • • • Research Collaboration Network Citation networks Product co-purchasing networks Internet peer-to-peer networks Road networks โฏโฏ Primitive Tasks Tasks Global Local : query biased Community detection Graph partitioning Local community detection Ranking PageRank Top-๐ query ; Random walk with restart Densest subgraph Global densest subgraph Local dense subgraph near the query node Applications: • Recommendation • Disease gene discovery • Advertisement • Disease pathway discovery Contributions Tasks Limitations of existing works Our contributions Local community detection Free rider effect Query biased densest subgraph Top-๐ proximity query Global : expensive; Local : approximate; specific Simple, unified and exact local search Densest subgraph detection Single network; Co-dense; Dual networks; Densest connected subgraph • Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 2015. • Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. • Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Generic Local Community Detection Problem Input: a) Graph ๐บ(๐, ๐ธ) b) A set of query nodes ๐ c) A goodness metric ๐ ๐ A Output: Subgraph ๐บ ๐ such that: 1) ๐ contains ๐ (๐ ⊆ ๐) 2) ๐ ๐ is maximized [1] M. Sozio, et al. KDD’10. [2] W. Cui, et al. SIGMOD’14. [3] L. Ma, et al. DaWak’13. [4] B. Saha, et al. RECOMB’10. [5] C. Tsourakakis, et al. SIGMOD’14. [6] A. Clauset, PRE’05. [7] F. Luo, et al. WIAS’08. [8] R. Andersen, et al. FOCS’06. Community Goodness Metrics Intuitions Internal denseness Internal denseness & external sparseness Boundary sharpness Goodness metrics Ref. Formulas ๐(๐) Classic density [1] ๐ ๐ /|๐| ๐ ๐ − ๐ผโ(|๐|) concave โ ๐ฅ Edge-surplus [2] Minimum degree [3,4] min๐ข∈๐ ๐ค๐ (๐ข) Subgraph modularity [5] ๐ ๐ /๐(๐, ๐) Density-isolation [6] ๐ ๐ − ๐ผ ๐ ๐, ๐ − ๐ฝ|๐| External conductance [7] ๐ ๐, ๐ /min{๐ ๐ , ๐(๐)} Local modularity [8] ๐ ๐ฟ๐, ๐ /๐(๐ฟ๐, ๐) [1] B. Saha, et al. RECOMB’10. [2] C. Tsourakakis, et al. SIGMOD’14. [3] M. Sozio, et al. KDD’10. [4] W. Cui, et al. SIGMOD’14. โ ๐ฅ = ๐ฅ 2 [5] F. Luo, et al. WIAS’08. [6] K. J. Lang, CIKM’07. [7] R. Andersen, et al. FOCS’06. [8] A. Clauset, PRE’05. Free Rider Effect A∪B A∪C Goodness metrics A Classic density 2.50 2.95 2.83 Edge-surplus 15.3 26.5 22.8 Minimum degree 4 4 4 Subgraph modularity 2.0 3.6 4.6 Density-isolation -2.6 3.8 1.5 Ext. conductance 0.25 0.14 0.11 Local modularity 0.63 0.70 0.78 [1] B. Saha, et al. RECOMB’10. [2] C. Tsourakakis, et al. SIGMOD’14. [3] M. Sozio, et al. KDD’10. [4] W. Cui, et al. SIGMOD’14. [5] F. Luo, et al. WIAS’08. [6] K. J. Lang, CIKM’07. [7] R. Andersen, et al. FOCS’06. [8] A. Clauset, PRE’05. Query Biased Node Weighting Node Weight: ๐(๐ข) = 1 ๐(๐ข) ๐ ๐ข : proximity value w.r.t. the query Query biased density: ๐(๐) ๐(๐) = ๐(๐) ๐ ๐ = ๐ข∈๐ ๐(๐ข) : sum of node weights Subgraph A becomes the query biased densest subgraph Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 8(7):798-809, 2015. Query Biased Densest Connected Subgraph (QDC) Problem and Two Related Problems QDC Input Output ๐บ๐: Complexity QDC’ 1) ๐บ(๐, ๐ธ) 2) query ๐ QDC’’ 1) ๐บ(๐, ๐ธ) 2) query ๐ ๐บ ๐, ๐ธ 1) ๐ contains ๐ 1) ๐ contains ๐ 2) ๐ ๐ is maximized 2) ๐ ๐ is maximized ๐ ๐ is maximized 3) ๐บ[๐] is connected NP-hard Polynomial Polynomial Optimal Optimal If ๐บ[๐] is connected If ๐ contains ๐ Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 8(7):798-809, 2015. Experiments——Datasets Dataset # Nodes # Edges # Communities Amazon 00,334,863 0,000,925,872 0,151,037 DBLP 00,317,080 0,001,049,866 0,013,477 Youtube 01,134,890 0,002,987,624 0,008,385 Orkut 03,072,441 0,117,185,083 6,288,363 LiveJournal 03,997,962 0,034,681,189 0,287,512 Friendster 65,608,366 1,806,067,135 0,957,154 [1] J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. In ICDM, 2012. [2] snap.stanford.edu Experiments——State-of-the-Art Methods Classes Abbr. Ref. DS Internal denseness Key Idea [1] Densest subgraph with query constraint OQC [2] Optimal quasi-clique; edge-surplus MDG [3] Minimum degree PRN Internal LS denseness & external EMC sparseness SM [4] External conductance Boundary [8] Local modularity LM [5] Local spectral [6] More internal edges than external edges [7] Subgraph modularity [1] B. Saha, et al. RECOMB’10. [2] C. Tsourakakis, et al. SIGMOD’14. [3] M. Sozio, et al. KDD’10. [4] R. Andersen, et al. FOCS’06. [5] M. W. Mahoney, et al. JMLR’12. [6] G. W. Flake, KDD’00. [7] F. Luo, et al. WIAS’08. [8] A. Clauset, PRE’05. Experiments: Effectiveness Evaluation Metrics Metrics F-score Formulas ๐น(๐, ๐) = 2 โ precision ๐, ๐ โ recall(๐, ๐) precision ๐, ๐ + recall(๐, ๐) ๐(๐) |๐| Density Community ๐ ๐ ′ , ๐\๐ ′ goodness Cohesiveness ๐min ′ ⊂๐ min{๐ (๐ ′ ), ๐ (๐\๐ ′ )} ๐ ๐ metrics ๐(๐) Separability ๐(๐, ๐) Consistency 1− 1 ๐ ๐ ๐ ′ ⊆๐, ๐′ = ๐ ๐น ๐, ๐ ′ − ๐นmean [1] J. Yang and J. Leskovec. Dening and evaluating network communities based on ground-truth. In ICDM, pages 745-754, 2012. [2] Ma, Lianhang, et al. GMAC: A seed-insensitive approach to local community detection. In DaWak, pages 297-308, 2013. 2 Effectiveness Evaluation —— F-Score F-score QDC DS LS EMC SM LM Amazon 0.83 0.52 0.54 0.46 0.69 0.66 0.61 0.60 0.58 DBLP 0.46 0.31 0.33 0.32 0.48 0.42 0.34 0.36 0.37 Youtube 0.43 0.23 0.22 0.17 0.26 0.24 0.21 0.21 0.22 Orkut 0.47 0.15 0.16 0.13 0.21 0.17 0.19 0.16 0.18 LiveJournal 0.64 0.48 0.47 0.40 0.52 0.51 0.47 0.48 0.49 Friendster 0.32 -- 0.14 0.12 0.17 0.16 -- 0.14 0.13 Avg. F-score 0.53 0.3 0.31 0.27 0.39 0.36 0.33 0.33 0.33 Avg. Precision 0.65 0.46 0.45 0.29 0.51 0.41 0.34 0.38 0.48 0.61 0.58 0.69 0.67 0.64 0.66 0.63 0.59 Avg. Recall 0.78 OQC MDG PRN Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 8(7):798-809, 2015. Effectiveness Evaluation——Goodness Metrics Community goodness metrics on LiveJournal graph Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 8(7):798-809, 2015. Effectiveness Evaluation——Consistency Consistency QDC DS OQC MDG PRN LS EMC SM LM Amazon 0.94 0.77 0.76 0.58 0.79 0.69 0.74 0.67 0.61 DBLP 0.88 0.62 0.64 0.37 0.65 0.53 0.56 0.43 0.56 Youtube 0.85 0.61 0.54 0.46 0.71 0.41 0.57 0.37 0.36 Orkut 0.83 0.56 0.52 0.32 0.68 0.43 0.51 0.54 0.47 LiveJournal 0.93 0.74 0.67 0.43 0.84 0.64 0.73 0.58 0.52 Friendster 0.78 -- 0.56 0.45 0.65 0.49 -- 0.32 0.39 Average 0.87 0.64 0.62 0.44 0.72 0.53 0.61 0.49 0.49 Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 8(7):798-809, 2015. Contributions Tasks Limitations of existing works Our contributions Local community detection Free rider effect Query biased densest subgraph Top-๐ proximity query Global : expensive; Local : approximate; specific Simple, unified and exact local search Densest subgraph detection Single network; Co-dense; Dual networks; Densest connected subgraph • Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 2015. • Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. • Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Top-๐ Proximity Query in Graphs ๏ Which nodes are most similar to the query node ? Random walk based proximity measures Query 1) Random walk with restart 2) Hitting time 3) Commute time Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Computational Methods for Top-๐ Query Methods Key Idea Precomputation? Applicability Global iteration (GI) Power method No Wide Matrix based method[2] Matrix decomposition Yes RWR Yes RWR / HT / CT Graph embedding [3] Graph embedding Disadvantages: • Iterating over the entire graph • Pre-computing step is expensive Challenge: An efficient local search method? • Guarantees the exactness • Applies to different measures [1] Y. Fujiwara, et al. SIGMOD’13 [2] Tong’ICDM’06; Fujiwara’KDD’12; Fujiwara’VLDB’12 [3] X. Zhao, et al. VLDB’13 Our Method —— FLoS (Fast Local Search) Contributions: 1) Exact top-๐ nodes 2) General method (a variety of proximity measures) 3) Simple local search strategy • no pre-processing • no global iteration Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. No Local Maximum Property Query Grid graph 20 Query 20 Local maximum No local maximum With local maximum Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Measures With and Without Local Maximum Abbr. HT DHT THT PHP EI RWR CT Proximity measures Hitting time Discounted hitting time Truncated hitting time Penalized hitting probability Effective importance (degree normalized RWR) Random walk with restart Commute time Local maximum ? No No No No No Yes Yes Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Bounding the Unvisited Nodes Query Grid graph 20 Query 20 Local maximum Visited Unvisited Boundary Boundary No local maximum With local maximum Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Bounding the Visited Nodes Upper bound Exact proximity value Lower bound Query Visited node Unvisited node Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Bounding the Visited Nodes——Monotonicity Upper bound Exact proximity value Lower bound Query Visited node Unvisited node Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Running Example Query Iteration 1 2 3 4 5 Newly visited nodes {2,3} {4} {5} {6,7} {8} Toy graph Top-2 nodes Trend of the bounds Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Relationships Among Proximity Measures • Penalized hitting probability • Effective importance • Discounted hitting time Theorem: PHP, EI, and DHT give the same ranking results. • Random walk with restart Theorem: RWR ๐ ∝ degree(๐) โ PHP(๐) Note: RWR has local maximum Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Experiments —— Datasets Datatsets Amazon Real Synthetic Abbr. #nodes AZ 0,334,863 #edges 00,925,872 DBLP Youtube DP YT 0,317,080 1,134,890 LiveJournal LJ -- 3,997,962 34,681,189 Varying size --- Varying density Varying size In-memory Disk-resident 01,049,866 02,987,624 Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Experiments —— State-of-the-Art Methods Our methods (exact) FLoS_PHP FLoS_RWR State-of-the-art methods Abbr. Key idea Ref. Exactness GI_PHP Global iteration -- Exact DNE Local search CIKM’12 Approx. NN_EI Local search CIKM’13 Exact LS_EI Local search KDD’10 Approx. GI_RWR Global iteration -- Exact Castanet Improved GI SIGMOD’13 Exact K-dash Matrix inversion VLDB’12 Exact GE_RWR Graph embedding VLDB’13 Approx. LS_RWR KDD’10 Approx. Local search Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Experiments —— PHP, Real Graphs Running time (AZ) # Visited nodes • 1-3 orders of magnitude faster • A small portion of the nodes are visited Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Experiments——RWR, Real Graphs Have long precomputing time Running time (AZ) # Visited nodes โฆ Fast โฆ A small portion of the nodes are visited Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. Extension to More Proximity Measures KZ ๐ ∝ PHP′(๐) RWR PHP’ EI ๐ ∝ PHP(๐) RWR ๐ ∝ ๐ค๐ โ PHP(๐) EI PHP ๐๐,๐ DHT DHT ๐ ∝ PHP(๐) RT ๐ ∝ ๐ฝ ๐ค๐ ๐๐,๐ ๐ค๐,๐ = ๐ค๐ ๐๐,๐ โ PHP(๐) ๐ค๐,๐ = ๐คmax ๐ค๐,๐ = ๐๐ + ๐ค๐ ๐๐ : constant PHP’’ RT PHP: RWR: EI: DHT: ๐ฝ ∈ [0,1] KZ AP AP ๐ ∝ ๐๐ โ PHP′′(๐) Penalized Hitting Probability Random Walk with Restart Effective Importance Discounted Hitting Time RT: RoundTripRank KZ: Katz score AP: Absorption Probability Extension to RoundTripRank i) How important? Reaching ๐ฃ from ๐ ii) How specific? Returning to ๐ from ๐ฃ RoundTripRank : RT ๐ ∝ RWR(๐)๐ฝ โ EI ๐ 1−๐ฝ ๐ฝ ∈ [0,1] ๐ฝ ๐∈๐๐ ๐๐,๐ ๐ฑ๐ EI ๐ ∝ PHP(๐) ∝ ๐ค๐ โ PHP ๐ i) Forward direction (RWR) ๐ ๐ฑ๐ = ๐ RWR ๐ ∝ ๐ค๐ โ PHP(๐) + (1 − ๐) if ๐ = ๐ if ๐ ≠ ๐ ๐∈๐๐ ๐๐,๐ ๐ฑ๐ ii) Backward direction (EI) ๐ ๐ฒ๐ = ๐ [1] Yuan Fang, et al. RoundTripRank: Graph-based proximity with importance and specificity? ICDE, 2013. ๐∈๐๐ ๐๐,๐ ๐ฒ๐ + (1 − ๐) if ๐ = ๐ if ๐ ≠ ๐ ๐∈๐๐ ๐๐,๐ ๐ฒ๐ RWR EI RT PHP Random Walk with Restart Effective Importance RoundTripRank Penalized Hitting Probability Extension to Katz score Let ๐ be the adjacency matrix, ๐ be the proximity matrix, ๐ ๐,๐ be the KZ proximity value between node ๐ and ๐ ๐ = ๐ ๐ + ๐ 2 ๐ 2 + ๐ 3 ๐ 3 + โฏ [๐ 2 ]๐,๐ : # of paths of length 2 ๐ (0 < ๐ < 1) : decay factor Recursive equations for KZ ๐ซ๐ = ๐ ๐∈๐๐ ๐๐,๐ ๐ซ๐ + ๐ ๐∈๐๐ ๐๐,๐ ๐ซ๐ where, ๐ = ๐ โ ๐คmax ๐ค๐,๐ ๐๐,๐ = ๐คmax 1 ๐คmax Recursive equations for PHP’ 1 ๐ซ๐ = ๐ if ๐ = ๐ if ๐ ≠ ๐ KZ ๐ ∝ PHP′(๐) [1] Leo Katz. A new status index derived from sociometric analysis. Psychometrika 18.1 (1953): 39-43. ๐∈๐๐ ๐๐,๐ ๐ซ๐ where, ๐๐,๐ PHP’: ๐๐,๐ KZ if ๐ = ๐ if ๐ ≠ ๐ ๐ค๐,๐ = ๐คmax ๐ค๐,๐ = ๐ค๐ Katz score PHP Penalized Hitting Probability PHP’ variant of PHP Extension to Absorption Probability AP , PHP’’ PHP original graph 0 ๐๐,๐ = ๐ค๐,๐ ๐ค๐ 5 4 1 3 2 6 ๐ซ๐′ = ′ ๐∈๐๐ ๐๐,๐ ๐ซ๐ if ๐ ≠ ๐ 7 1 1 ๐๐ +๐ค๐ 2 [1] Xiao-Ming Wu, et al. Learning with partially absorbing random walks. NIPS, 2012. if ๐ ≠ ๐ 4 5 1 7 AP ๐ ∝ ๐๐ โ PHP′′(๐) if ๐ ≠ ๐ if ๐ = ๐ ๐๐ +๐ค๐ 6 if ๐ = ๐ ๐๐ ๐๐ +๐ค๐ ๐ค๐,๐ ๐๐,๐ = 8 3 recursive equation for AP ′ ๐∈๐๐ ๐๐,๐ ๐ซ๐ + if ๐ = ๐ 5 4 8 self-loop trans. prob. 6 7 recursive eq. for PHP’’ ๐ซ๐ = AP PHP PHP” 3 2 8 1 ๐∈๐๐ ๐๐,๐ ๐ซ๐ if ๐ = ๐ if ๐ ≠ ๐ Absorption probability Penalized Hitting Probability Variant of PHP Reverse-Proximity Query Problem proximity matrix ๐ ๐ ๐ • Proximity value of node ๐ ๐ ๐,๐ : proximity value of node ๐ w.r.t. ๐ • Reverse-proximity value of node ๐ ๐ ๐ ๐ ๐,๐ : proximity value of node ๐ w.r.t. ๐ Naïve solution: Query each node : ๐(๐ผ๐) Total running time : ๐(๐ผ๐๐) • • Andras A. Benczur, et al. SpamRank - Fully Automatic Link Spam Detection Work in progress. In AIRWeb, 2005. Yuan Fang, et al. RoundTripRank: Graph-based proximity with importance and specificity. In ICDE, 2013. Extension to Reverse-Proximity Measures rEI ๐ ∝ PHP(๐) rEI rKZ ๐ ∝ PHP′(๐) rRWR rRWR ๐ ∝ PHP(๐) PHP(๐) rPHP ๐ ∝ EI๐ (๐) rPHP PHP(๐) rDHT ๐ ∝ EI๐ (๐) rDHT EI๐ (๐) is pre-computed. PHP’ ๐๐,๐ PHP ๐๐,๐ ๐ค๐,๐ = ๐ค๐ ๐๐,๐ 1−๐ฝ rRT ๐ ∝ ๐ค๐ rRT PHP: RWR: EI: DHT: โ PHP(๐) rKZ ๐ค๐,๐ = ๐คmax ๐ค๐,๐ = ๐๐ + ๐ค๐ ๐๐ : constant PHP’’ ๐ฝ ∈ [0,1] Penalized Hitting Probability Random Walk with Restart Effective Importance Discounted Hitting Time rAP rAP ๐ ∝ PHP′′(๐) RT: RoundTripRank KZ: Katz score AP: Absorption Probability Experimental Results —— Running Time RT KZ AP rPHP Yubao Wu, Ruoming Jin, Xiang Zhang. Unified and Exact Local Search for Random Walk Based Top-K Proximity Query in Large Graphs. Submitted to VLDBJ, 2015. Experimental Results —— # of Visited Nodes LJ YT All are local search methods Contributions Tasks Limitations of existing works Our contributions Local community detection Free rider effect Query biased densest subgraph Top-๐ proximity query Global : expensive; Local : approximate; specific Simple, unified and exact local search Densest subgraph detection Single network; Co-dense; Dual networks; Densest connected subgraph • Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 2015. • Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. • Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Dual Biological Networks (a) protein interaction network Edge: physical bounding interaction (b) genetic interaction network Edge: conceptual statistical interaction ( ๐ ๐ test score ) Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Other Dual Networks Physical network Dual social networks Dual Co-author networks Social network Co-author network Conceptual network Interest similarity network Research interest similarity network Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Densest Connected Subgraph ๐บ๐ (๐, ๐ธ๐ ) : ๐บ๐ (๐, ๐ธ๐ ) ๐ธ(๐) ๐ The Densest Connected Subgraph (DCS) Problem: Given dual networks ๐บ(๐, ๐ธ๐ , ๐ธ๐ ), find ๐ ⊆ ๐ such that: (a) ๐บ๐ [๐] is connected; (b) the density of ๐บ๐ [๐] is maximized. Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. DCS in Dual Networks Dual networks Dual biological Dual social Dual co-author DCS Disease pathway Consumer group (advertising) Research group Connectivity Signal transduction word-of-mouth Collaboration pattern Density Statistical association Consumer interest Similar research interest Theorem: The DCS problem is NP-hard. Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. DCS_RDS Algorithm Step 1. Find the densest subgraph in the conceptual network Step 2. Make it connected in the physical network DCS_GND Algorithm At each iteration, it deletes a set of nodes 1) low degree in the conceptual network 2) Connectivity of the physical network Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Experiments Three kinds of the dual networks Dual networks Physical network Conceptual network Data sources Biological protein interaction genetic interaction BioGrid; WTCCC Co-author co-author network research interest similar. Social social network interest similarity DBLP Flixster / Epinions Statistics of the dual networks Dual networks Abbr. #nodes #edges in ๐ฎ๐ #edges in ๐ฎ๐ Protein-Genetic Bio 8,468 25,715 67,744 Research-DM DM 7,169 14,526 30,000 Research-DB DB 6,131 17,940 30,000 Recom-Epinions EP 49,288 487,002 313,432 Recom-Flixster FX 786,936 7,058,819 2,713,671 Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. DCS_๐ from dual biological networks (๐ = 40, WTCCC) (a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network MYO6, CUBN, and STK39 have been reported to be associated with hypertension disease. Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. DCS_seed from dual biological networks (WTCCC) (a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network Renin pathway genes are in red ellipses. NEDD4L has been reported to be associated with hypertension disease. Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Significance evaluation Methods GenGen GRASS Plink HYST DCS 2.4 × 10−6 1.0 × 10−6 2.3 × 10−6 1.1 × 10−6 DCS_๐ 5.6 × 10−6 1.3 × 10−6 4.6 × 10−6 3.7 × 10−6 DCS_seed 8.5 × 10−6 4.9 × 10−6 1.5 × 10−6 2.5 × 10−6 DS 0.36 0.47 0.33 0.17 MSCS 0.15 0.13 0.21 0.12 DCSs are identified from the WTCCC dataset; testing on the ARIC dataset. DS: densest subgraph in protein interaction network; MSCS: maximum score connected subgraph Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Dual Co-Author Networks (a) Subgraph in co-author network (b) Subgraph in research interest similarity net. DCS_๐ (๐ = 30, data mining research community) DCS_seed (database research community) Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015. Dual Biological Networks —— Node Weights (a) protein interaction network Edge: physical bounding interaction Edge: conceptual statistical interaction Edge Weights : Node Weights : two-locus test statistics single-locus test statistics Genetic Marker 1 Gene (b) genetic interaction network Maker ID ๐๐ test Gene Marker ID score Genetic Marker 2 SLC12A3 rs11076172 ABCG5 rs10495909 3.29 SLC12A3 rs11076172 THBS2 rs9294977 2.89 SLC12A3 rs11076172 HDAC2 rs13194921 2.43 Gene Maker ID ๐๐ test score ESR1 rs11155820 1.18 SLC12A3 rs11076172 1.01 Genetic Markers ABCG5 rs10495909 0.96 Extension —— Node Weights (edge weighted) ๐(๐) : ๐ ๐บ๐ (๐, ๐ธ๐ ) ๐บ๐ (๐, ๐ธ๐ ) ๐ค๐,๐ : edge weight of edge (๐, ๐) ๐(๐) : node weight of node ๐ (node/edge weighted) ๐ ๐ +๐(๐) : ๐ ๐ ๐ = ๐,๐∈๐ ๐ค๐,๐ ๐ ๐ = ๐∈๐ ๐(๐) Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. Densest Connected Subgraph : Node Weights ๐บ๐ (๐, ๐ธ๐ ) ๐บ๐ (๐, ๐ธ๐ ) (node/edge weighted) ๐ ๐ +๐(๐) : ๐ The Densest Connected Subgraph (DCS) Problem: Given dual networks ๐บ(๐, ๐ธ๐ , ๐ธ๐ ), find ๐ ⊆ ๐ such that: (a) ๐บ๐ [๐] is connected; ๐ ๐ +๐(๐) (b) the density of ๐บ๐ [๐] is maximized. ๐ Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. DCS_RDS Algorithm ๐บ๐ ๐บ๐ ๐บ๐ ๐บ๐ Find the densest subgraph in conceptual network (with node weights) Refine the densest subgraph in physical network Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. Extension of Three Algorithms Density ๐(๐) ๐ ๐ ๐ = ๐,๐∈๐ ๐ค๐,๐ ๐ ๐ = ๐∈๐ ๐(๐) ๐ ๐ + ๐(๐) ๐ Without node weights With node weights greedy node deletion 2-approximation 2-approximation removing low degree nodes d-core retains the densest subgraph d-core retains the densest subgraph parametric exact densest subgraph exact densest subgraph maximum flow Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. DCS_GND Algorithm Articulation nodes Delete the low degree non-articulation nodes ๐บ๐ ๐บ๐ ๐บ๐′ ๐บ๐ ๐บ๐ ๐บ๐′ Low degree nodes Node degree Compute the density Without Node Weights With Node Weights ๐ค(๐ข): node degree ๐ค(๐ข) ๐ค ′ ๐ข = ๐ค ๐ข + ๐(๐ข) ๐(๐ข): node weight Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. DCS_๐ from dual biological networks (๐ = 40, WTCCC, node weights) (a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network MYO6, CUBN, and STK39 have been reported to be associated with hypertension disease. Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. Significance evaluation Without node weights With node weights Methods GenGen GRASS Plink HYST DCS 7.2 × 10−6 8.2 × 10−6 7.6 × 10−6 4.8 × 10−8 DCS_k 8.9 × 10−5 1.6 × 10−5 2.2 × 10−5 4.5 × 10−7 DCS_seed 6.3 × 10−4 2.1 × 10−5 9.3 × 10−5 1.8 × 10−5 DCS 5.8 × 10−6 4.6 × 10−6 6.7 × 10−6 1.4 × 10−6 DCS_k 8.2 × 10−5 8.7 × 10−6 9.4 × 10−6 1.5 × 10−7 DCS_seed 4.1 × 10−4 7.7 × 10−6 7.4 × 10−5 9.1 × 10−6 DCSs are identified from the WTCCC dataset; tested on the ARIC dataset. Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang. Mining Dual Networks: Models, Algorithms and Applications. TKDD, 2015. Robust Local Community Detection Top-๐ Query; Fast Local Search Thank You! Dual Networks; Densest Connected Subgraph • Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on free rider effect and its elimination. PVLDB, 2015. • Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based ๐-nearest-neighbor query in large graphs. SIGMOD, 2014. • Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected subgraphs in dual networks. ICDE, 2015.