Graph Sparsification Lecture

advertisement
Local Sparsification for Scalable
Module Identification in
Networks
Srinivasan Parthasarathy
Joint work with V. Satuluri, Y. Ruan, D. Fuhry, Y. Zhang
Data Mining Research Laboratory
Dept. of Computer Science and Engineering
The Ohio State University
1
The Data
Deluge
“Every 2 days we create as much
information as we did up to 2003”
- Eric Schmidt, Google ex-CEO
2
Data Storage Costs are Low
600$
to buy a disk drive that can store all of the world’s
music
[McKinsey Global Institute Special Report, June ’11]
3
Data does not exist in isolation.
4
Data almost always exists in
connection with other data.
5
Social networks
VLSI networks
Protein Interactions
Data dependencies
6
Internet
Neighborhood
graphs
All this data is only useful if we can
scalably extract useful knowledge
7
Challenges
1. Large Scale
Billion-edge graphs commonplace
Scalable solutions are needed
8
Challenges
2. Noise
Links on the web, protein interactions
Need to alleviate
9
Challenges
3. Novel structure
Hub nodes, small world phenomena, clusters
of varying densities and sizes, directionality
Novel algorithms or techniques are needed
10
Challenges
4. Domain Specific Needs
E.g. Balance, Constraints etc.
Need mechanisms to specify
11
Challenges
5. Network Dynamics
How do communities evolve? Which
actors have influence? How do clusters
change as a function of external factors?
12
Challenges
6. Cognitive Overload
Need to support guided interaction for
human in the loop
13
Our Vision and Approach
Application Domains
Bioinformatics (ISMB’07, ISMB’09, ISMB’12, ACM BCB’11, BMC’12)
Social Network and Social Media Analysis (TKDD’09, WWW’11, WebSci’12, WebSci’12)
Graph Pre-processing
Core Clustering
Sparsification
SIGMOD ’11, WebSci’12
Consensus Clustering
KDD’06, ISMB’07
Near Neighbor Search
For non-graph data
PVLDB ’12
Viewpoint Neighborhood
Analysis
KDD ’09
Symmetrization
For directed graphs
EDBT ’10
Graph Clustering via
Stochastic Flows
KDD ’09, BCB ’10
Dynamic Analysis and
Visualization
Event Based Analysis
KDD’07,TKDD’09
Network Visualization
KDD’08
Density Plots
SIGMOD’08, ICDE’12
Scalable Implementations and Systems Support on Modern Architectures
Multicore Systems (VLDB’07, VLDB’09), GPUs (VLDB’11), STCI Cell (ICS’08),
Clusters (ICDM’06, SC’09, PPoPP’07, ICDE’10)
14
Graph Sparsification for
Community Discovery
SIGMOD ’11, WebSci’12
15
Is there a simple pre-processing of the graph to
reduce the edge set that can “clarify” or
“simplify” its cluster structure?
16
Graph Clustering and Community Discovery
Given a graph, discover groups of nodes that
are strongly connected to one another but
weakly connected to the rest of the graph.
17
Graph Clustering : Applications
Social Network and Graph Compression
Direct Analytics on compressed
representation
18
Graph Clustering : Applications
Optimize VLSI layout
19
Graph Clustering : Applications
Protein function prediction
20
Graph Clustering : Applications
Data distribution to minimize
communication and balance load
21
Is there a simple pre-processing of the graph to
reduce the edge set that can “clarify” or
“simplify” its cluster structure?
22
Preview
Original
Sparsified
[Automatically visualized using Prefuse]
23
The promise
Clustering algorithms can run much faster and be
more accurate on a sparsified graph.
Ditto for Network Visualization
24
Utopian Objective
Retain edges which are likely to be intra-cluster
edges, while discarding likely inter-cluster edges.
25
A way to rank edges on “strength” or
similarity.
26
Algorithm: Global Sparsification (G-Spar)
Parameter: Sparsification ratio, s
1. For each edge <i,j>:
(i) Calculate Sim ( <i,j> )
2. Retain top s% of edges in order of Sim, discard
others
27
Dense clusters are over-represented, sparse
clusters under-represented
Works great when the goal is to just find the top
communities
28
Algorithm: Local Sparsification (L-Spar)
Parameter: Sparsification exponent, e (0 < e < 1)
1. For each node i of degree di:
(i) For each neighbor j:
(a) Calculate Sim ( <i,j> )
(ii) Retain top (d i)e neighbors in order of Sim,
for node i
Underscoring the importance of Local Ranking
29
Ensures representation of clusters of varying densities
30
But...
Similarity computation is expensive!
31
A randomized, approximate solution based on
Minwise Hashing
[Broder et. al., 1998]
32
Minwise Hashing
Universe
{ dog, cat, lion, tiger, mouse}
[ cat, mouse, lion, dog, tiger]
[ lion, cat, mouse, dog, tiger]
A = { mouse, lion }
mh1(A) = min (
{ mouse, lion } ) = mouse
mh2(A) = min (
{ mouse, lion } ) = lion
33
Key Fact
For two sets A, B, and a min-hash function mhi():
Unbiased estimator for Sim using k hashes:
34
Time complexity using Minwise Hashing
Hashes
Edges
Only 2 sequential passes over input.
Great for disk-resident data
Note: exact similarity is less important – we really
just care about relative ranking  lower k
35
•
•
Theoretical Analysis of L-Spar: Main
Results
Q: Why choose top de edges for a node of degree d?
•
A: Conservatively sparsify low-degree nodes, aggressively sparsify
hub nodes. Easy to control degree of sparsification.
Proposition: If input graph has power-law degree distn. with
exponent , then sparsified graph also has power-law degree
distn. with exponent   e  1
e
•
Corollary: The sparsification ratio corresponding to exponent e
is no more than
 2
  e 1
•
For  = 2.1 and e = 0.5, ~17% edges will be retained.
•
Higher  (steeper power-laws) and/or lower e leads to more
sparsification.
•
Experiments
Datasets
•
•
•
•
•
3 PPI networks (BioGrid, DIP, Human)
2 Information (Wiki, Flickr) & 2 Social (Orkut , Twitter) networks
Largest network (Orkut), roughly a Billion edges
Ground truth available for PPI networks and Wiki
Clustering algorithms
•
Metis [Karypis & Kumar ‘98], MLR-MCL [Satuluri & Parthasarathy, ‘09],
Metis+MQI [Lang & Rao ‘04], Graclus [Dhillon et. al. ’07], Spectral
methods [Shi ’00], Edge-based agglomerative/divisive methods [Newman
’04]
•
Compared sparsifications
•
L-Spar, G-Spar, RandomEdge and ForestFire
Results Using Metis
Dataset
(n, m)
Spars.
Ratio
Yeast_Noisy
(6k, 200k)
Random
G-Spar
Speed
Quality
Speed Quality
17%
11x
-10%
30x
Wiki
(1.1M, 53M)
15%
8x
-26%
Orkut
(3M, 117M)
17%
13x
+20%
L-Spar
Speed
Quality
-15%
25x
+11%
104x
-24%
52x
+50%
30x
+60%
36x
+60%
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
38
Results Using Metis
Dataset
(n, m)
Spars.
Ratio
Yeast_Noisy
(6k, 200k)
Random
G-Spar
Speed
Quality
Speed Quality
17%
11x
-10%
30x
Wiki
(1.1M, 53M)
15%
8x
-26%
Orkut
(3M, 117M)
17%
13x
+20%
L-Spar
Speed
Quality
-15%
25x
+11%
104x
-24%
52x
+50%
30x
+60%
36x
+60%
Same sparsification ratio for all 3 methods.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
39
Results Using Metis
Dataset
(n, m)
Spars.
Ratio
Yeast_Noisy
(6k, 200k)
Random
G-Spar
Speed
Quality
Speed Quality
17%
11x
-10%
30x
Wiki
(1.1M, 53M)
15%
8x
-26%
Orkut
(3M, 117M)
17%
13x
+20%
L-Spar
Speed
Quality
-15%
25x
+11%
104x
-24%
52x
+50%
30x
+60%
36x
+60%
Good speedups, but typically loss in quality.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
40
Results Using Metis
Dataset
(n, m)
Spars.
Ratio
Yeast_Noisy
(6k, 200k)
Random
G-Spar
Speed
Quality
Speed Quality
17%
11x
-10%
30x
Wiki
(1.1M, 53M)
15%
8x
-26%
Orkut
(3M, 117M)
17%
13x
+20%
L-Spar
Speed
Quality
-15%
25x
+11%
104x
-24%
52x
+50%
30x
+60%
36x
+60%
Great speedups and quality.
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
41
L-Spar: Results Using MLR-MCL
Dataset
(n, m)
Spars.
Ratio
Yeast_Noisy
(6k, 200k)
L-Spar
Speed
Quality
17%
17x
+4%
Wiki
(1.1M, 53M)
15%
23x
-4.5%
Orkut
(3M, 117M)
17%
22x
0%
[Hardware: Quad-core Intel i5 CPU, 3.2 GHz, with 16GB RAM ]
42
L-Spar: Qualitative Examples
Node
Retained neighbors
Discarded neighbors
Graph (Wiki article)
Graph Theory,
Adjacency list,
Adjacency matrix,
Model theory
Tessellation,Roman
letters used in
Mathematics,
Morphism
Jack Dorsey (Twitter
user, and co-founder)
Biz Stone, Evan
Williams, Jason
Goldman, Sarah
Lacy
Alyssa Milano,
JetBlue Airways,
WholeFoods, Parul
Sharma
Gladiator (Flickr tag)
colosseum, worldheritage, site,
italy
europe, travel,
canon, sky, summer
Twitter executives, Silicon Valley figures
Impact of Sparsification on Noisy Data
As the graphs get noisier, L-Spar is increasingly
beneficial.
44
Impact of Sparsification on Spectrum: Yeast PPI
Impact of Sparsification on Spectrum:
Epinion
Local sparsification seems to Global Sparsification results
match trends of original graph
in multiple components
Impact of Sparsification on Spectrum: Human
PPI
Impact of Sparsification on Spectrum: Flickr
Some measure of density
Anatomy of density plot
Specific ordering of 49the vertices in the graph
Density Overlay Plots
sual Comparison between Global vs Local Sparsification
50
Summary
Sparsification: Simple pre-processing that makes a big difference
•
•
•
•
•
Only tens of seconds to execute on multi-million-node graphs.
Reduces clustering time from hours down to minutes.
Improves accuracy by removing noisy edges for several algorithms.
Helps visualization
Ongoing and future work
•
•
Spectral results suggests one might be able to provide theoretical
rationale – Can we tease it out?
Investigate other kinds of graphs, incorporating content, novel
applications (e.g. wireless sensor networks, VLSI design)
51
Prior Work
• Random edge Sampling
[Karger ‘94]
• Sampling in proportion to effective resistances:
good guarantees but very slow [Spielman and Srivastava ‘08]
• Matrix sparsification
: Fast, but same as
random sampling in the absence of weights.
[Arora et. al. ’06]
52
mpt oeing
t he
ms.
metset s
er of
ver,
law
any
iat e
t ion
heir
hub-
e of
Topological Measures
18]
N cut(S) =
i ∈ S, j ∈ S̄
i∈S
A(i , j )
degr ee(i )
+
i ∈ S, j ∈ S̄
j ∈ S̄
A(i , j )
degr ee(j )
(1)
where A is t he (symmet ric) adjacency mat rix and S̄ = V − S
is t he complement of S. Int uit ively, groups wit h low normalized cut are well connect ed amongst t hemselves but are
sparsely connect ed t o t he rest of t he graph.
T he connect ion between random walks and normalized
cut s is as follows [18] : N cut(S) in Equat ion 1 is t he same
as t he probabilit y t hat a random walk t hat is st art ed in t he
st at ionary dist ribut ion will t ransit ion eit her from a vert ex
in S t o a vert ex in S̄ or vice-versa, in one st ep [18]
P r (S → S̄)
P r ( S̄ → S)
N cut(S) =
+
P r (S)
P r ( S̄)
(2)
Using t he unifying concept 53of random walks, Equat ion 2
Modularity (from Wikipedia )
• Modularity is the fraction of the edges that fall
within the given groups minus the expected
such fraction if edges were distributed at
random. The value of the modularity lies in the
range [−1/2,1). It is positive if the number of
edges within groups exceeds the number
expected on the basis of chance.
54
arat ely
n t hat
ure enh node.
hat afues of e
s given
nt e
Pr oposit ion 1. For input graphs wi th a power-law degree distri buti on wi th exponent α, the local ly sparsified graphs
obtai ned usi ng Algorithm 2 also have a power-law degree di stribution with exponent α + ee− 1 .
Pr oof . Let D or i g and D sp a r se be random variables for
t he degree of t he original and t he sparsified graphs respect ively.
Since D or i g follows a power-law wit h exponent α, we have
p(D or i g = d) = Cd− α and t he complement ary CDF is given
by P (D or i g > d) = Cd1− α (approximat ing t he discret e
power-law dist ribut ion wit h a cont inuous one, as is common [8]).
e
From A lgorit hm 2, D sp a r se = D or
i g . T hen we have
P (D spa r se > d)
=
=
e
1/ e
P (D or
>
d)
=
P
(D
>
d
)
or i g
ig
1/ e
C d
1− α
1− α + ee − 1
= Cd
Hence, D sp a r se follows a power-law wit h exponent α + ee− 1 .
Let t he cut -off paramet er for D or i g be dcu t (i.e. t he power
law dist ribut ion does not hold below dcu t [8]), t hen t he corresponding cut -off for D spa r se will be decu t .
out of
55
he Pr oof . Not ice t hat t he sparsificat ion rat io is t he same as
ig- t he rat Pr
io oposit
of t heion
expect
onont he
sparse
graph
versus
2. Ted
he degree
sparsi ficati
ratio
(i.e. the
number
om t he expect
of edgesedi n degree
the sparsified
graph
|E spa r se
|, versusWe
the have,
originalfrom
on t he
original
graph.
by
graph |E |), for
the
power
law part
of the degree[8]
distribution
t he expressions
for
t
he
means
of
power-laws
is at most α α− −e−2 1 .
ncn?
we
not
gh
α− 1
Pr oof . Not ice
hatort he
ion
Et[D
=
dcurat
i g ] sparsificat
t io is t he same as
α on
− 2t he sparse graph versus
t he rat io of t he expect ed degree
α + e− graph.
1
t he expect ed degree on t he original
We have, from
−
1
e
e
t he expressions
t her means
power-laws
[8]
E for
[D spa
·
dcu t
se ] = of
α + e− 1
e
− 2
α− 1
E
[D
d
or i g ] =
T hen t he sparsificat ion rat io αis− 2 cu t
α + e− 1
−1
e
α + e− 1
−2
e
α− 1
α− 2
e− 1
· dcu t
≤
α + e− 1
−1
e
α + e− 1
−2
e
α− 1
α− 2
α− 2
=
α − e− 1
For a fixed e and a graph wit h known α, t his can be calculat ed in advance of t he sparsificat ion.
56
The MCL algorithm
Input: A, Adjacency matrix
Initialize M to MG, the canonical
transition matrix M:= MG:= (A+I) D-1
Enhances flow to well-connected nodes
(i.e. nodes within a community).
Expand: M := M*M
Inflate: M := M.^r (r usually
2), renormalize columns
Increases inequality in each column.
“Rich get richer, poor get poorer.”
(reduces flow across communities)
Prune
Converged?
No
Saves memory by removing entries close
to zero. Enables faster convergence
Yes
Output clusters
Clustering Interpretation: Nodes
flowing into the same sink node are
assigned same cluster labels
Download