ppt - kaist

advertisement
ICDE 2014
LinkSCAN*: Overlapping Community Detection
Using the Link-Space Transformation
Sungsu Lim †, Seungwoo Ryu ‡, Sejeong Kwon§,
Kyomin Jung ¶, and Jae-Gil Lee †
Dept. of Knowledge Service Engineering, KAIST
‡ Samsung Advanced Institute of Technology
§ Graduate School of Cultural Technology, KAIST
¶ Dept. of Electrical and Computer Engineering, SNU
†
Contents
 Motivation
 Link-Space Transformation
 Proposed Algorithm: LinkSCAN*
 Experiment Evaluation
 Conclusions
April 1,2014
2
Community Detection
 Network communities
 Sets of nodes where the nodes in the same set are
similar (more internal links) and the nodes in different
sets are dissimilar (less external links)
 Communities, clusters, modules, groups, etc.
 Non-overlapping community detection
 Finding a good partition of nodes
Clusters are NOT
overlapped
April 1,2014
3
Overlapping
Community Detection
 A person (node) can belong to multiple
communities, e.g., family,
family, friends,
friends, colleagues, etc.
 Overlapping community detection allows that a
node can be included in different groups
April 1,2014
4
Existing Methods
 Node-based: A node overlaps if more than one belonging
coefficient values are larger than some threshold
 Label Propagation (COPRA) [Gregory 2010, Subelj and Bajec 2011]
 Structure-based: A node overlaps if it participates in multiple
base structures with different memberships
 Clique Percolation (CPM) [Palla et al. 2005, Derenyi et al. 2005]
 Link Partition [Evans and Lambiotte 2009 , Ahn et al. 2010]
f(i,c1)=0.35,
f(i,c2)=0.05,
f(i,c3)=0.4, …
i
Base structure:
cliques of size 𝑘
𝜏=0.3
i
Base structure:
links
𝑘=4
i
f(i,c)=mean(f(j,c))
j ∈ nbr(i)
April 1,2014
5
Limitations of Existing Methods
 The existing methods do not perform well for
 1. networks with many highly overlapping nodes,
 2. networks with various base structures, and
 3. networks with many weak-ties
f(i,c1)=0.2, f(i,c2)=0.15,
f(i,c3)=0.25, f(i,c4)=0.2, …
Weak-tie
𝜏=0.3
c2
c1
i
c3
𝑘≥3
i
i
c4
i: overlapping
COPRA fails
April 1,2014
i: non-overlapping
CPM fails
i: non-overlapping
Link partition fails
6
Contents
 Motivation
 Link-Space Transformation
 Proposed Algorithm: LinkSCAN*
 Experiment Evaluation
 Conclusions
April 1,2014
7
Our Solution
 We propose a new framework called the link-space
transformation that transforms a given graph into
the link-space graph
 We develop an algorithm that performs a nonoverlapping clustering on the link-space graph,
which enables us to discover overlapping clustering
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Link
Communities
Non-overlapping
Clustering
Overlapping
Communities
Membership
Translation
8
Overall Procedure
 We propose an overlapping clustering algorithm
using the link-space transformation
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Link
Communities
Non-overlapping
Clustering
Overlapping
Communities
Membership
Translation
9
Link-Space Transformation
 Topological structure
 Each link of an original graph maps to a node of the link-space graph
 Two nodes of the links-space graph are adjacent if the corresponding
two links of the original graph are incident
 Weights
 Weights of links of the link-space graph are calculated from the
similarity of corresponding links of the original graph
0
1
3
2
i1
4
i0
i
5
April 1,2014
i2
j
ik
k
6
j1
j2
j3
jk
k8
k5
7
8
k6
j4
k7
𝑤 𝑣𝑖𝑘 , 𝑣𝑗𝑘 = 𝜎 𝑒𝑖𝑘 , 𝑒𝑗𝑘
10
Overall Procedure
 Overlapping clustering algorithm using the link-
space transformation
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Link
Communities
Non-overlapping
Clustering
Overlapping
Communities
Membership
Translation
11
Clustering on Link-Space Graph
 Applying a non-overlapping clustering algorithm to
the link-space graph
 We use structural clustering that can assign a node
into hubs or outliers (neutral membership)
0
4
1
13
3
2
1/2
5
Original graph
April 1,2014
03
12
1
1
1/2
34
23
1/2
Another weights
are less than 1/3
35 1/2 45
Non-overlapping clustering
on the link-space graph
12
Overall Procedure
 Overlapping clustering algorithm using the link-
space transformation
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Link
Communities
Non-overlapping
Clustering
Overlapping
Communities
Membership
Translation
13
Membership Translation
 Memberships of nodes of the link-space graph map
to the memberships of links of the original graph
 Memberships of a node of the original graph are
from the memberships of incident links of the node
0
03
13
34
1
1
1/2
12
1/2
23
3
1/2
35 1/2 45
Non-overlapping clustering
on the link-space graph
April 1,2014
4
1
2
5
Membership translation
14
Advantages of Link-Space Graph
 Inheriting the advantages of the link-space graph, finding
disjoint communities enables us to find overlapping
communities where its original structure is preserved since
similarity properly reflect the structure of the original graph.
Preserving the original
structure
+
Easier to find overlapping
communities
Link-space graph
Easier to find overlapping communities
while preserving the original structure
April 1,2014
15
Contents
 Motivation
 Link-Space Transformation
 Proposed Algorithm: LinkSCAN*
 Experiment Evaluation
 Conclusions
April 1,2014
16
LinkSCAN*
 We propose an efficient overlapping clustering
algorithm using the link-space transformation
For a massive graph,
it may be dense
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Link
Communities
Structural
Clustering
Overlapping
Communities
Membership
Translation
17
LinkSCAN*
 We propose an efficient overlapping clustering
algorithm using the link-space transformation
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Sampling
process
Link
Communities
Structural
Clustering
Overlapping
Communities
Membership
Translation
18
LinkSCAN*
 We propose an efficient overlapping clustering
algorithm using the link-space transformation
Original
Graph
Link-Space
Graph
Link-Space
Transformation
April 1,2014
Sampled
Graph
Link
Sampling
Link
Communities
Structural
Clustering
Overlapping
Communities
Membership
Translation
19
Link Sampling
 Sampling Strategy: For each node 𝑣, we sample 𝑛𝑣
incident links of 𝑣, where 𝑛𝑣 = min 𝑑𝑣 , 𝛼 + 𝛽 ln 𝑑𝑣
and 𝑑𝑣 is the degree of 𝑣
 Thm 1 guarantees that sampling errors are not
significant even when 𝑛𝑣 is small
 For real nets, a sampled graph and the link-space graph
are close (NMI>0.9) , while sampling rate is small (~0.1)
 Thm 1 (Error bound)
 Applying Chernoff bound, the estimation error of selecting
core nodes decreases exponentially as the 𝑛𝑣 ’s increase.
April 1,2014
20
Contents
 Motivation
 Link-Space Transformation
 Proposed Algorithm: LinkSCAN*
 Experiment Evaluation
 Conclusions
April 1,2014
21
Network Datasets
 Synthetic network: LFR benchmark networks
[Lancichinetti and Fortunato 2009]
 Real network: Social and information networks
[snap.stanford.edu/data/ and www.nd.edu/~networks/resources.htm]
# nodes
# links
Aver. degree
Clust. Coeff.
1,068,037
3,800,963
7.50
0.19
Amazon
334,863
925,872
5.53
0.21
Enron-email
36,692
183,831
10.02
0.08
Brightkite
58,228
214,078
7.35
0.11
Facebook
63,392
816,886
25.77
0.15
WWW
325,729
1,090,108
6.69
0.09
DBLP
April 1,2014
22
Performance Evaluation
 When ground-truth is known
 NMI for overlapping clustering [ancichietti et al. 2009]
 F-score (performance of identifying overlapping nodes)
 When ground-truth is unknown
 Quality (Mov): Modularity for overlapping clustering [Lazar et al. 2010]
 Coverage (CC): Clustering coverage [Ahn et al. 2010]
April 1,2014
23
Problem 1
 For networks with many highly overlapping nodes,
LinkSCAN* outperforms the existing methods.
April 1,2014
24
Problem 2
 For networks with various base-structures, our
method performs well compared to the existing
methods
April 1,2014
25
Problem 3
 For networks with many weak ties, the existing
methods fail for the following toy networks. But,
LinkSCAN* detects all the clusters well
April 1,2014
26
Real Networks
 For real network datasets, the normalized measure
of (Quality + Coverage) indicates that LinkSCAN* is
better than the existing methods.
April 1,2014
27
Link Sampling
 The comparisons between the use of the link-space
graph (LinkSCAN) and the use of sampled graphs
(LinkSCAN*) show that LinkSCAN* improves
efficiency with small errors
Enron-email network
# nodes = 37K
# links = 184K
𝛼 = 0.5 𝑑 ~16 𝑑
𝛽=1
April 1,2014
28
Scalability
 The running time of LinkSCAN∗ for a set of LFR
benchmark networks shows that LinkSCAN∗ has
near-linear scalability
LFR benchmark networks
# nodes = 1K to 1M
# links = 10K to 10M
𝛼=2𝑑
𝛽=1
April 1,2014
29
Contents
 Motivation
 Link-Space Transformation
 Proposed Algorithm: LinkSCAN*
 Experiment Evaluation
 Conclusions
April 1,2014
30
Conclusions
 We propose a notion of the link-space
transformation and develop a new overlapping
clustering algorithms LinkSCAN* that satisfy
membership neutrality
 LinkSCAN* outperforms existing algorithms for the
networks with many highly overlapping nodes and
those with various base-structures
April 1,2014
31
Acknowledgement
 Coauthors
 Funding Agencies
 This research was supported by National Research
Foundation of Korea
April 1,2014
32
Thank You!
April 1,2014
33
Download