Dual networks

advertisement
Efficient and Effective Local Algorithms
for Analyzing Massive Graphs
Yubao Wu
Case Western Reserve University
August 28, 2015
Graphs are Everywhere
Social Networking Websites
Biological Networks
•
•
•
•
•
Research Collaboration Network
Citation networks
Product co-purchasing networks
Internet peer-to-peer networks
Road networks
โ‹ฏโ‹ฏ
Primitive Tasks
Tasks
Global
Local : query biased
Community detection
Graph partitioning
Local community
detection
Ranking
PageRank
Top-๐‘˜ query ;
Random walk with restart
Densest subgraph
Global densest subgraph
Local dense subgraph
near the query node
Applications:
• Recommendation
• Disease gene discovery
• Advertisement
• Disease pathway discovery
Contributions
Tasks
Limitations of existing works
Our contributions
Local community
detection
Free rider effect
Query biased densest
subgraph
Top-๐‘˜ proximity
query
Global : expensive;
Local : approximate; specific
Simple, unified and exact
local search
Densest subgraph
detection
Single network;
Co-dense;
Dual networks;
Densest connected subgraph
• Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on
free rider effect and its elimination. PVLDB, 2015.
• Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based
๐‘˜-nearest-neighbor query in large graphs. SIGMOD, 2014.
• Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected
subgraphs in dual networks. ICDE, 2015.
Generic Local Community Detection Problem
Input:
a) Graph ๐บ(๐‘‰, ๐ธ)
b) A set of query nodes ๐‘„
c) A goodness metric ๐‘“ ๐‘†
A
Output: Subgraph ๐บ ๐‘† such that:
1) ๐‘† contains ๐‘„ (๐‘„ ⊆ ๐‘†)
2) ๐‘“ ๐‘† is maximized
[1] M. Sozio, et al. KDD’10.
[2] W. Cui, et al. SIGMOD’14.
[3] L. Ma, et al. DaWak’13.
[4] B. Saha, et al. RECOMB’10.
[5] C. Tsourakakis, et al. SIGMOD’14.
[6] A. Clauset, PRE’05.
[7] F. Luo, et al. WIAS’08.
[8] R. Andersen, et al. FOCS’06.
Community Goodness Metrics
Intuitions
Internal
denseness
Internal
denseness &
external
sparseness
Boundary
sharpness
Goodness metrics
Ref.
Formulas ๐‘“(๐‘†)
Classic density
[1]
๐‘’ ๐‘† /|๐‘†|
๐‘’ ๐‘† − ๐›ผโ„Ž(|๐‘†|)
concave โ„Ž ๐‘ฅ
Edge-surplus
[2]
Minimum degree
[3,4]
min๐‘ข∈๐‘† ๐‘ค๐‘† (๐‘ข)
Subgraph modularity
[5]
๐‘’ ๐‘† /๐‘’(๐‘†, ๐‘†)
Density-isolation
[6]
๐‘’ ๐‘† − ๐›ผ ๐‘’ ๐‘†, ๐‘† − ๐›ฝ|๐‘†|
External conductance
[7]
๐‘’ ๐‘†, ๐‘† /min{๐œ™ ๐‘† , ๐œ™(๐‘†)}
Local modularity
[8]
๐‘’ ๐›ฟ๐‘†, ๐‘† /๐‘’(๐›ฟ๐‘†, ๐‘‰)
[1] B. Saha, et al. RECOMB’10.
[2] C. Tsourakakis, et al. SIGMOD’14.
[3] M. Sozio, et al. KDD’10.
[4] W. Cui, et al. SIGMOD’14.
โ„Ž ๐‘ฅ =
๐‘ฅ
2
[5] F. Luo, et al. WIAS’08.
[6] K. J. Lang, CIKM’07.
[7] R. Andersen, et al. FOCS’06.
[8] A. Clauset, PRE’05.
Free Rider Effect
A∪B A∪C
Goodness metrics
A
Classic density
2.50
2.95
2.83
Edge-surplus
15.3
26.5
22.8
Minimum degree
4
4
4
Subgraph modularity
2.0
3.6
4.6
Density-isolation
-2.6
3.8
1.5
Ext. conductance
0.25
0.14
0.11
Local modularity
0.63
0.70
0.78
[1] B. Saha, et al. RECOMB’10.
[2] C. Tsourakakis, et al. SIGMOD’14.
[3] M. Sozio, et al. KDD’10.
[4] W. Cui, et al. SIGMOD’14.
[5] F. Luo, et al. WIAS’08.
[6] K. J. Lang, CIKM’07.
[7] R. Andersen, et al. FOCS’06.
[8] A. Clauset, PRE’05.
Query Biased Node Weighting
Node Weight:
๐œ‹(๐‘ข) =
1
๐‘Ÿ(๐‘ข)
๐‘Ÿ ๐‘ข : proximity value w.r.t. the query
Query biased density:
๐‘’(๐‘†)
๐œŒ(๐‘†) =
๐œ‹(๐‘†)
๐œ‹ ๐‘† =
๐‘ข∈๐‘† ๐œ‹(๐‘ข)
: sum of node weights
Subgraph A becomes the
query biased densest subgraph
Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local
community detection: on free rider effect and its elimination.
PVLDB, 8(7):798-809, 2015.
Query Biased Densest Connected Subgraph
(QDC) Problem and Two Related Problems
QDC
Input
Output
๐บ๐‘†:
Complexity
QDC’
1) ๐บ(๐‘‰, ๐ธ)
2) query ๐‘„
QDC’’
1) ๐บ(๐‘‰, ๐ธ)
2) query ๐‘„
๐บ ๐‘‰, ๐ธ
1) ๐‘† contains ๐‘„
1) ๐‘† contains ๐‘„
2) ๐œŒ ๐‘† is maximized 2) ๐œŒ ๐‘† is maximized ๐œŒ ๐‘† is maximized
3) ๐บ[๐‘†] is connected
NP-hard
Polynomial
Polynomial
Optimal
Optimal
If ๐บ[๐‘†] is connected
If ๐‘† contains ๐‘„
Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local
community detection: on free rider effect and its elimination.
PVLDB, 8(7):798-809, 2015.
Experiments——Datasets
Dataset
# Nodes
# Edges
# Communities
Amazon
00,334,863
0,000,925,872
0,151,037
DBLP
00,317,080
0,001,049,866
0,013,477
Youtube
01,134,890
0,002,987,624
0,008,385
Orkut
03,072,441
0,117,185,083
6,288,363
LiveJournal
03,997,962
0,034,681,189
0,287,512
Friendster
65,608,366
1,806,067,135
0,957,154
[1] J. Yang and J. Leskovec. Defining and evaluating network
communities based on ground-truth. In ICDM, 2012.
[2] snap.stanford.edu
Experiments——State-of-the-Art Methods
Classes
Abbr. Ref.
DS
Internal
denseness
Key Idea
[1] Densest subgraph with query constraint
OQC
[2] Optimal quasi-clique; edge-surplus
MDG
[3] Minimum degree
PRN
Internal
LS
denseness
& external EMC
sparseness
SM
[4] External conductance
Boundary
[8] Local modularity
LM
[5] Local spectral
[6] More internal edges than external edges
[7] Subgraph modularity
[1] B. Saha, et al. RECOMB’10.
[2] C. Tsourakakis, et al. SIGMOD’14.
[3] M. Sozio, et al. KDD’10.
[4] R. Andersen, et al. FOCS’06.
[5] M. W. Mahoney, et al. JMLR’12.
[6] G. W. Flake, KDD’00.
[7] F. Luo, et al. WIAS’08.
[8] A. Clauset, PRE’05.
Experiments: Effectiveness Evaluation Metrics
Metrics
F-score
Formulas
๐น(๐‘†, ๐‘‡) = 2 โˆ™
precision ๐‘†, ๐‘‡ โˆ™ recall(๐‘†, ๐‘‡)
precision ๐‘†, ๐‘‡ + recall(๐‘†, ๐‘‡)
๐‘’(๐‘†)
|๐‘†|
Density
Community
๐‘’ ๐‘† ′ , ๐‘†\๐‘† ′
goodness Cohesiveness ๐‘†min
′ ⊂๐‘† min{๐œ™ (๐‘† ′ ), ๐œ™ (๐‘†\๐‘† ′ )}
๐‘†
๐‘†
metrics
๐‘’(๐‘†)
Separability
๐‘’(๐‘†, ๐‘†)
Consistency
1−
1
๐‘†
๐‘„
๐‘„ ′ ⊆๐‘†, ๐‘„′ = ๐‘„
๐น ๐‘†, ๐‘† ′ − ๐นmean
[1] J. Yang and J. Leskovec. Dening and evaluating network communities
based on ground-truth. In ICDM, pages 745-754, 2012.
[2] Ma, Lianhang, et al. GMAC: A seed-insensitive approach to local
community detection. In DaWak, pages 297-308, 2013.
2
Effectiveness Evaluation —— F-Score
F-score
QDC
DS
LS
EMC
SM
LM
Amazon
0.83
0.52
0.54
0.46
0.69
0.66
0.61
0.60
0.58
DBLP
0.46
0.31
0.33
0.32
0.48
0.42
0.34
0.36
0.37
Youtube
0.43
0.23
0.22
0.17
0.26
0.24
0.21
0.21
0.22
Orkut
0.47
0.15
0.16
0.13
0.21
0.17
0.19
0.16
0.18
LiveJournal
0.64
0.48
0.47
0.40
0.52
0.51
0.47
0.48
0.49
Friendster
0.32
--
0.14
0.12
0.17
0.16
--
0.14
0.13
Avg. F-score
0.53
0.3
0.31
0.27
0.39
0.36
0.33
0.33
0.33
Avg. Precision 0.65
0.46
0.45
0.29
0.51
0.41
0.34
0.38
0.48
0.61
0.58
0.69
0.67
0.64
0.66
0.63
0.59
Avg. Recall
0.78
OQC MDG PRN
Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local
community detection: on free rider effect and its elimination.
PVLDB, 8(7):798-809, 2015.
Effectiveness Evaluation——Goodness Metrics
Community goodness metrics on LiveJournal graph
Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local
community detection: on free rider effect and its elimination.
PVLDB, 8(7):798-809, 2015.
Effectiveness Evaluation——Consistency
Consistency QDC
DS
OQC MDG PRN
LS
EMC
SM
LM
Amazon
0.94
0.77
0.76
0.58
0.79
0.69
0.74
0.67
0.61
DBLP
0.88
0.62
0.64
0.37
0.65
0.53
0.56
0.43
0.56
Youtube
0.85
0.61
0.54
0.46
0.71
0.41
0.57
0.37
0.36
Orkut
0.83
0.56
0.52
0.32
0.68
0.43
0.51
0.54
0.47
LiveJournal
0.93
0.74
0.67
0.43
0.84
0.64
0.73
0.58
0.52
Friendster
0.78
--
0.56
0.45
0.65
0.49
--
0.32
0.39
Average
0.87
0.64
0.62
0.44
0.72
0.53
0.61
0.49
0.49
Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local
community detection: on free rider effect and its elimination.
PVLDB, 8(7):798-809, 2015.
Contributions
Tasks
Limitations of existing works
Our contributions
Local community
detection
Free rider effect
Query biased densest
subgraph
Top-๐‘˜ proximity
query
Global : expensive;
Local : approximate; specific
Simple, unified and exact
local search
Densest subgraph
detection
Single network;
Co-dense;
Dual networks;
Densest connected subgraph
• Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on
free rider effect and its elimination. PVLDB, 2015.
• Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based
๐‘˜-nearest-neighbor query in large graphs. SIGMOD, 2014.
• Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected
subgraphs in dual networks. ICDE, 2015.
Top-๐‘˜ Proximity Query in Graphs
๏ƒ˜ Which nodes are most similar to the query node ?
Random walk based
proximity measures
Query
1) Random walk with restart
2) Hitting time
3) Commute time
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Computational Methods for Top-๐‘˜ Query
Methods
Key Idea
Precomputation?
Applicability
Global iteration (GI)
Power method
No
Wide
Matrix based
method[2]
Matrix
decomposition
Yes
RWR
Yes
RWR / HT / CT
Graph embedding [3] Graph embedding
Disadvantages:
• Iterating over the entire graph
• Pre-computing step is expensive
Challenge: An efficient local search method?
• Guarantees the exactness
• Applies to different measures
[1] Y. Fujiwara, et al. SIGMOD’13
[2] Tong’ICDM’06; Fujiwara’KDD’12; Fujiwara’VLDB’12
[3] X. Zhao, et al. VLDB’13
Our Method —— FLoS (Fast Local Search)
Contributions:
1) Exact top-๐‘˜ nodes
2) General method (a variety of proximity measures)
3) Simple local search strategy
• no pre-processing
• no global iteration
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
No Local Maximum Property
Query
Grid graph
20
Query
20
Local maximum
No local maximum
With local maximum
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Measures With and Without Local Maximum
Abbr.
HT
DHT
THT
PHP
EI
RWR
CT
Proximity measures
Hitting time
Discounted hitting time
Truncated hitting time
Penalized hitting probability
Effective importance
(degree normalized RWR)
Random walk with restart
Commute time
Local maximum ?
No
No
No
No
No
Yes
Yes
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Bounding the Unvisited Nodes
Query
Grid graph
20
Query
20
Local maximum
Visited
Unvisited
Boundary
Boundary
No local maximum
With local maximum
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Bounding the Visited Nodes
Upper bound
Exact proximity value
Lower bound
Query
Visited node
Unvisited node
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Bounding the Visited Nodes——Monotonicity
Upper bound
Exact proximity value
Lower bound
Query
Visited node
Unvisited node
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Running Example
Query
Iteration
1
2
3
4
5
Newly visited nodes
{2,3}
{4}
{5}
{6,7}
{8}
Toy graph
Top-2 nodes
Trend of the bounds
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Relationships Among Proximity Measures
• Penalized hitting probability
• Effective importance
• Discounted hitting time
Theorem: PHP, EI, and DHT give the same ranking results.
• Random walk with restart
Theorem:
RWR ๐‘– ∝ degree(๐‘–) โˆ™ PHP(๐‘–)
Note: RWR has local maximum
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Experiments —— Datasets
Datatsets
Amazon
Real
Synthetic
Abbr. #nodes
AZ
0,334,863
#edges
00,925,872
DBLP
Youtube
DP
YT
0,317,080
1,134,890
LiveJournal
LJ
--
3,997,962 34,681,189
Varying size
---
Varying density
Varying size
In-memory
Disk-resident
01,049,866
02,987,624
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Experiments —— State-of-the-Art Methods
Our methods
(exact)
FLoS_PHP
FLoS_RWR
State-of-the-art methods
Abbr.
Key idea
Ref.
Exactness
GI_PHP
Global iteration
--
Exact
DNE
Local search
CIKM’12
Approx.
NN_EI
Local search
CIKM’13
Exact
LS_EI
Local search
KDD’10
Approx.
GI_RWR
Global iteration
--
Exact
Castanet
Improved GI
SIGMOD’13
Exact
K-dash
Matrix inversion
VLDB’12
Exact
GE_RWR Graph embedding
VLDB’13
Approx.
LS_RWR
KDD’10
Approx.
Local search
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Experiments —— PHP, Real Graphs
Running time (AZ)
# Visited nodes
• 1-3 orders of magnitude faster
• A small portion of the nodes are visited
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Experiments——RWR, Real Graphs
Have long
precomputing time
Running time (AZ)
# Visited nodes
โฆ Fast
โฆ A small portion of the nodes are visited
Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search
for random walk based ๐‘˜-nearest-neighbor query in large graphs.
SIGMOD, 2014.
Extension to More Proximity Measures
KZ ๐‘– ∝ PHP′(๐‘–)
RWR
PHP’
EI ๐‘– ∝ PHP(๐‘–)
RWR ๐‘– ∝ ๐‘ค๐‘– โˆ™ PHP(๐‘–)
EI
PHP
๐‘๐‘–,๐‘—
DHT
DHT ๐‘– ∝ PHP(๐‘–)
RT ๐‘– ∝
๐›ฝ
๐‘ค๐‘–
๐‘๐‘–,๐‘—
๐‘ค๐‘–,๐‘—
=
๐‘ค๐‘–
๐‘๐‘–,๐‘—
โˆ™ PHP(๐‘–)
๐‘ค๐‘–,๐‘—
=
๐‘คmax
๐‘ค๐‘–,๐‘—
=
๐œ†๐‘– + ๐‘ค๐‘–
๐œ†๐‘– : constant
PHP’’
RT
PHP:
RWR:
EI:
DHT:
๐›ฝ ∈ [0,1]
KZ
AP
AP ๐‘– ∝ ๐œ†๐‘– โˆ™ PHP′′(๐‘–)
Penalized Hitting Probability
Random Walk with Restart
Effective Importance
Discounted Hitting Time
RT: RoundTripRank
KZ: Katz score
AP: Absorption Probability
Extension to RoundTripRank
i) How important?
Reaching ๐‘ฃ from ๐‘ž
ii) How specific?
Returning to ๐‘ž from ๐‘ฃ
RoundTripRank : RT ๐‘– ∝ RWR(๐‘–)๐›ฝ โˆ™ EI ๐‘– 1−๐›ฝ
๐›ฝ ∈ [0,1]
๐›ฝ
๐‘—∈๐‘๐‘– ๐‘๐‘—,๐‘– ๐ฑ๐‘—
EI ๐‘– ∝ PHP(๐‘–)
∝ ๐‘ค๐‘– โˆ™ PHP ๐‘–
i) Forward direction (RWR)
๐‘
๐ฑ๐‘– =
๐‘
RWR ๐‘– ∝ ๐‘ค๐‘– โˆ™ PHP(๐‘–)
+ (1 − ๐‘) if ๐‘– = ๐‘ž
if ๐‘– ≠ ๐‘ž
๐‘—∈๐‘๐‘– ๐‘๐‘—,๐‘– ๐ฑ๐‘—
ii) Backward direction (EI)
๐‘
๐ฒ๐‘– =
๐‘
[1] Yuan Fang, et al. RoundTripRank: Graph-based proximity
with importance and specificity? ICDE, 2013.
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ฒ๐‘—
+ (1 − ๐‘) if ๐‘– = ๐‘ž
if ๐‘– ≠ ๐‘ž
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ฒ๐‘—
RWR
EI
RT
PHP
Random Walk with Restart
Effective Importance
RoundTripRank
Penalized Hitting Probability
Extension to Katz score
Let ๐– be the adjacency matrix, ๐‘ be the proximity matrix,
๐‘ ๐‘–,๐‘— be the KZ proximity value between node ๐‘– and ๐‘—
๐‘ = ๐œ…๐– + ๐œ… 2 ๐– 2 + ๐œ… 3 ๐– 3 + โ‹ฏ
[๐– 2 ]๐‘–,๐‘— : # of paths of length 2
๐œ… (0 < ๐œ… < 1) : decay factor
Recursive equations for KZ
๐ซ๐‘– =
๐‘
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘— +
๐‘
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘—
where, ๐‘ = ๐œ… โˆ™ ๐‘คmax
๐‘ค๐‘–,๐‘—
๐‘๐‘–,๐‘— =
๐‘คmax
1
๐‘คmax
Recursive equations for PHP’
1
๐ซ๐‘– = ๐‘
if ๐‘– = ๐‘ž
if ๐‘– ≠ ๐‘ž
KZ ๐‘– ∝ PHP′(๐‘–)
[1] Leo Katz. A new status index derived from sociometric
analysis. Psychometrika 18.1 (1953): 39-43.
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘—
where, ๐‘๐‘–,๐‘—
PHP’: ๐‘๐‘–,๐‘—
KZ
if ๐‘– = ๐‘ž
if ๐‘– ≠ ๐‘ž
๐‘ค๐‘–,๐‘—
=
๐‘คmax
๐‘ค๐‘–,๐‘—
=
๐‘ค๐‘–
Katz score
PHP Penalized Hitting Probability
PHP’ variant of PHP
Extension to Absorption Probability
AP , PHP’’
PHP
original graph
0
๐‘๐‘–,๐‘— =
๐‘ค๐‘–,๐‘—
๐‘ค๐‘–
5
4
1
3
2
6
๐ซ๐‘–′
=
′
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘—
if ๐‘– ≠ ๐‘—
7
1
1
๐œ†๐‘– +๐‘ค๐‘–
2
[1] Xiao-Ming Wu, et al. Learning with partially absorbing
random walks. NIPS, 2012.
if ๐‘– ≠ ๐‘—
4 5
1
7
AP ๐‘– ∝ ๐œ†๐‘– โˆ™ PHP′′(๐‘–)
if ๐‘– ≠ ๐‘ž
if ๐‘– = ๐‘—
๐œ†๐‘– +๐‘ค๐‘–
6
if ๐‘– = ๐‘ž
๐œ†๐‘–
๐œ†๐‘– +๐‘ค๐‘–
๐‘ค๐‘–,๐‘—
๐‘๐‘–,๐‘— =
8
3
recursive equation for AP
′
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘— +
if ๐‘– = ๐‘—
5
4
8
self-loop
trans. prob.
6
7
recursive eq. for PHP’’
๐ซ๐‘– =
AP
PHP
PHP”
3
2
8
1
๐‘—∈๐‘๐‘– ๐‘๐‘–,๐‘— ๐ซ๐‘—
if ๐‘– = ๐‘ž
if ๐‘– ≠ ๐‘ž
Absorption probability
Penalized Hitting Probability
Variant of PHP
Reverse-Proximity Query Problem
proximity matrix ๐‘
๐‘ž
๐‘–
• Proximity value of node ๐‘–
๐‘ ๐‘ž,๐‘– : proximity value of node ๐‘– w.r.t. ๐‘ž
• Reverse-proximity value of node ๐‘–
๐‘ž
๐‘–
๐‘ ๐‘–,๐‘ž : proximity value of node ๐‘ž w.r.t. ๐‘–
Naïve solution:
Query each node : ๐‘‚(๐›ผ๐‘š)
Total running time : ๐‘‚(๐›ผ๐‘š๐‘›)
•
•
Andras A. Benczur, et al. SpamRank - Fully Automatic Link Spam
Detection Work in progress. In AIRWeb, 2005.
Yuan Fang, et al. RoundTripRank: Graph-based proximity with
importance and specificity. In ICDE, 2013.
Extension to Reverse-Proximity Measures
rEI ๐‘– ∝ PHP(๐‘–)
rEI
rKZ ๐‘– ∝ PHP′(๐‘–)
rRWR
rRWR ๐‘– ∝ PHP(๐‘–)
PHP(๐‘–)
rPHP ๐‘– ∝
EI๐‘– (๐‘–)
rPHP
PHP(๐‘–)
rDHT ๐‘– ∝
EI๐‘– (๐‘–)
rDHT
EI๐‘– (๐‘–) is pre-computed.
PHP’
๐‘๐‘–,๐‘—
PHP
๐‘๐‘–,๐‘—
๐‘ค๐‘–,๐‘—
=
๐‘ค๐‘–
๐‘๐‘–,๐‘—
1−๐›ฝ
rRT ๐‘– ∝ ๐‘ค๐‘–
rRT
PHP:
RWR:
EI:
DHT:
โˆ™ PHP(๐‘–)
rKZ
๐‘ค๐‘–,๐‘—
=
๐‘คmax
๐‘ค๐‘–,๐‘—
=
๐œ†๐‘– + ๐‘ค๐‘–
๐œ†๐‘– : constant
PHP’’
๐›ฝ ∈ [0,1]
Penalized Hitting Probability
Random Walk with Restart
Effective Importance
Discounted Hitting Time
rAP
rAP ๐‘– ∝ PHP′′(๐‘–)
RT: RoundTripRank
KZ: Katz score
AP: Absorption Probability
Experimental Results —— Running Time
RT
KZ
AP
rPHP
Yubao Wu, Ruoming Jin, Xiang Zhang. Unified and Exact Local Search
for Random Walk Based Top-K Proximity Query in Large Graphs.
Submitted to VLDBJ, 2015.
Experimental Results —— # of Visited Nodes
LJ
YT
All are local search methods
Contributions
Tasks
Limitations of existing works
Our contributions
Local community
detection
Free rider effect
Query biased densest
subgraph
Top-๐‘˜ proximity
query
Global : expensive;
Local : approximate; specific
Simple, unified and exact
local search
Densest subgraph
detection
Single network;
Co-dense;
Dual networks;
Densest connected subgraph
• Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on
free rider effect and its elimination. PVLDB, 2015.
• Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based
๐‘˜-nearest-neighbor query in large graphs. SIGMOD, 2014.
• Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected
subgraphs in dual networks. ICDE, 2015.
Dual Biological Networks
(a) protein interaction network
Edge: physical bounding interaction
(b) genetic interaction network
Edge: conceptual statistical interaction
( ๐œ’ ๐Ÿ test score )
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Other Dual Networks
Physical network
Dual social networks
Dual Co-author networks
Social network
Co-author network
Conceptual network Interest similarity network Research interest similarity network
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Densest Connected Subgraph
๐บ๐‘Ž (๐‘‰, ๐ธ๐‘Ž )
:
๐บ๐‘ (๐‘‰, ๐ธ๐‘ )
๐ธ(๐‘†)
๐‘†
The Densest Connected Subgraph (DCS) Problem:
Given dual networks ๐บ(๐‘‰, ๐ธ๐‘Ž , ๐ธ๐‘ ), find ๐‘† ⊆ ๐‘‰ such that:
(a) ๐บ๐‘Ž [๐‘†] is connected;
(b) the density of ๐บ๐‘ [๐‘†] is maximized.
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
DCS in Dual Networks
Dual
networks
Dual biological
Dual social
Dual co-author
DCS
Disease pathway
Consumer group
(advertising)
Research group
Connectivity
Signal transduction
word-of-mouth
Collaboration
pattern
Density
Statistical association
Consumer interest
Similar research
interest
Theorem: The DCS problem is NP-hard.
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
DCS_RDS Algorithm
Step 1. Find the densest subgraph in the
conceptual network
Step 2. Make it connected in the
physical network
DCS_GND Algorithm
At each iteration, it deletes a set of nodes
1) low degree in the conceptual network
2) Connectivity of the physical network
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Experiments
Three kinds of the dual networks
Dual networks
Physical network
Conceptual network
Data sources
Biological
protein interaction
genetic interaction
BioGrid; WTCCC
Co-author
co-author network research interest similar.
Social
social network
interest similarity
DBLP
Flixster / Epinions
Statistics of the dual networks
Dual networks
Abbr.
#nodes
#edges in ๐‘ฎ๐’‚
#edges in ๐‘ฎ๐’ƒ
Protein-Genetic
Bio
8,468
25,715
67,744
Research-DM
DM
7,169
14,526
30,000
Research-DB
DB
6,131
17,940
30,000
Recom-Epinions
EP
49,288
487,002
313,432
Recom-Flixster
FX
786,936
7,058,819
2,713,671
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
DCS_๐‘˜ from dual biological networks (๐‘˜ = 40, WTCCC)
(a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network
MYO6, CUBN, and STK39 have been reported to be associated
with hypertension disease.
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
DCS_seed from dual biological networks (WTCCC)
(a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network
Renin pathway genes are in red ellipses.
NEDD4L has been reported to be associated with hypertension disease.
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Significance evaluation
Methods
GenGen
GRASS
Plink
HYST
DCS
2.4 × 10−6 1.0 × 10−6 2.3 × 10−6
1.1 × 10−6
DCS_๐‘˜
5.6 × 10−6 1.3 × 10−6 4.6 × 10−6
3.7 × 10−6
DCS_seed 8.5 × 10−6 4.9 × 10−6 1.5 × 10−6
2.5 × 10−6
DS
0.36
0.47
0.33
0.17
MSCS
0.15
0.13
0.21
0.12
DCSs are identified from the WTCCC dataset; testing on the ARIC dataset.
DS: densest subgraph in protein interaction network;
MSCS: maximum score connected subgraph
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Dual Co-Author Networks
(a) Subgraph in co-author network
(b) Subgraph in research interest similarity net.
DCS_๐‘˜ (๐‘˜ = 30, data mining research community)
DCS_seed (database research community)
Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding
dense and connected subgraphs in dual networks. ICDE, 2015.
Dual Biological Networks —— Node Weights
(a) protein interaction network
Edge: physical bounding interaction
Edge: conceptual statistical interaction
Edge Weights :
Node Weights :
two-locus test statistics
single-locus test statistics
Genetic Marker 1
Gene
(b) genetic interaction network
Maker ID
๐œ’๐Ÿ
test
Gene Marker ID score
Genetic Marker 2
SLC12A3 rs11076172 ABCG5 rs10495909
3.29
SLC12A3 rs11076172 THBS2
rs9294977
2.89
SLC12A3 rs11076172 HDAC2 rs13194921
2.43
Gene
Maker ID
๐œ’๐Ÿ
test
score
ESR1
rs11155820
1.18
SLC12A3 rs11076172
1.01
Genetic Markers
ABCG5
rs10495909
0.96
Extension —— Node Weights
(edge weighted)
๐‘’(๐‘†)
:
๐‘†
๐บ๐‘Ž (๐‘‰, ๐ธ๐‘Ž )
๐บ๐‘ (๐‘‰, ๐ธ๐‘ )
๐‘ค๐‘–,๐‘— : edge weight of edge (๐‘–, ๐‘—)
๐‘Ÿ(๐‘–) : node weight of node ๐‘–
(node/edge weighted)
๐‘’ ๐‘† +๐‘Ÿ(๐‘†)
:
๐‘†
๐‘’ ๐‘† =
๐‘–,๐‘—∈๐‘† ๐‘ค๐‘–,๐‘—
๐‘Ÿ ๐‘† =
๐‘–∈๐‘† ๐‘Ÿ(๐‘–)
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
Densest Connected Subgraph : Node Weights
๐บ๐‘Ž (๐‘‰, ๐ธ๐‘Ž )
๐บ๐‘ (๐‘‰, ๐ธ๐‘ )
(node/edge weighted)
๐‘’ ๐‘† +๐‘Ÿ(๐‘†)
:
๐‘†
The Densest Connected Subgraph (DCS) Problem:
Given dual networks ๐บ(๐‘‰, ๐ธ๐‘Ž , ๐ธ๐‘ ), find ๐‘† ⊆ ๐‘‰ such that:
(a) ๐บ๐‘Ž [๐‘†] is connected;
๐‘’ ๐‘† +๐‘Ÿ(๐‘†)
(b) the density
of ๐บ๐‘ [๐‘†] is maximized.
๐‘†
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
DCS_RDS Algorithm
๐บ๐‘
๐บ๐‘Ž
๐บ๐‘Ž
๐บ๐‘
Find the densest subgraph
in conceptual network
(with node weights)
Refine the densest subgraph
in physical network
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
Extension of Three Algorithms
Density
๐‘’(๐‘†)
๐‘†
๐‘’ ๐‘† =
๐‘–,๐‘—∈๐‘† ๐‘ค๐‘–,๐‘—
๐‘Ÿ ๐‘† =
๐‘–∈๐‘† ๐‘Ÿ(๐‘–)
๐‘’ ๐‘† + ๐‘Ÿ(๐‘†)
๐‘†
Without node weights
With node weights
greedy node
deletion
2-approximation
2-approximation
removing low
degree nodes
d-core retains the
densest subgraph
d-core retains the
densest subgraph
parametric
exact densest subgraph exact densest subgraph
maximum flow
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
DCS_GND Algorithm
Articulation nodes
Delete the low degree
non-articulation nodes
๐บ๐‘Ž
๐บ๐‘Ž
๐บ๐‘Ž′
๐บ๐‘
๐บ๐‘
๐บ๐‘′
Low degree nodes
Node degree
Compute the density
Without Node Weights
With Node Weights
๐‘ค(๐‘ข): node degree
๐‘ค(๐‘ข)
๐‘ค ′ ๐‘ข = ๐‘ค ๐‘ข + ๐‘Ÿ(๐‘ข)
๐‘Ÿ(๐‘ข): node weight
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
DCS_๐‘˜ from dual biological networks
(๐‘˜ = 40, WTCCC, node weights)
(a) Subgraph in protein interaction network (b) Subgraph in genetic interaction network
MYO6, CUBN, and STK39 have been reported to be associated
with hypertension disease.
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
Significance evaluation
Without
node
weights
With
node
weights
Methods
GenGen
GRASS
Plink
HYST
DCS
7.2 × 10−6
8.2 × 10−6
7.6 × 10−6
4.8 × 10−8
DCS_k
8.9 × 10−5
1.6 × 10−5
2.2 × 10−5
4.5 × 10−7
DCS_seed
6.3 × 10−4
2.1 × 10−5
9.3 × 10−5
1.8 × 10−5
DCS
5.8 × 10−6
4.6 × 10−6
6.7 × 10−6
1.4 × 10−6
DCS_k
8.2 × 10−5
8.7 × 10−6
9.4 × 10−6
1.5 × 10−7
DCS_seed
4.1 × 10−4
7.7 × 10−6
7.4 × 10−5
9.1 × 10−6
DCSs are identified from the WTCCC dataset;
tested on the ARIC dataset.
Yubao Wu, Xiaofeng Zhu, Li Li, Wei Fan, Ruoming Jin, Xiang Zhang.
Mining Dual Networks: Models, Algorithms and Applications.
TKDD, 2015.
Robust Local
Community Detection
Top-๐‘˜ Query;
Fast Local Search
Thank You!
Dual Networks;
Densest Connected Subgraph
• Yubao Wu, Ruoming Jin, Jing Li, and Xiang Zhang. Robust local community detection: on
free rider effect and its elimination. PVLDB, 2015.
• Yubao Wu, Ruoming Jin, Xiang Zhang. Fast and unified local search for random walk based
๐‘˜-nearest-neighbor query in large graphs. SIGMOD, 2014.
• Yubao Wu, Ruoming Jin, Xiaofeng Zhu, and Xiang Zhang. Finding dense and connected
subgraphs in dual networks. ICDE, 2015.
Download