slides - Minas Gjoka

advertisement
Estimating Clique Composition and Size
Distributions from Sampled Network Data
Minas Gjoka, Emily Smith, Carter T. Butts
University of California, Irvine
Outline
• Problem statement
• Estimation methodology
• Results with real-life graphs
Cliques
order-1
A complete subgraph that contains
i vertices is an order-i clique
order-2
order-3
order-4
order-5
…
order-i
A maximal clique is a clique that is
not included in a larger clique
Cliques
A complete subgraph that contains
i vertices is an order-i clique
order-3
b
b
a
A maximal clique is a clique that is
not included in a larger clique
c
order-4
b
a
c
d
d
d
4 non-maximal
order-3 cliques
b
a
c
a
c
d
Counting of Cliques
Ci is the count of order-i cliques (maximal or non-maximal)
C1
C2
order-1
order-2
graph G
3
2
1
C3
order-3
4
8
6
C4
5
order-4
7
Clique Distribution of G
C = (C1, C2, C3,
C 4)
= ( 0, 1, 2, 1 )
Goal 1: Estimate Ci (for all i) in graph G from sampled network
data
Counting of Cliques
Vertex Attributes
Vertex Attribute vector Xj j=1..p, p<=N
p =3
u =[ 3
0
0]
graph G
3
2
1
4
u =[ 2
1
8
0]
6
u =[ 2
0
5
7
1]
Cu is the count of order-u cliques
Clique Composition
Distribution of G
Goal 2: Estimate Cu (for all u) in graph G from sampled network data
Motivation
• Counting of Cliques
– cliques describe local structure (clustering, cohesive subgroups)
– algorithmic implications of cliques in engineering context
– cliques used as input in network models
• Sampled network data
– unknown graphs with access limitations
– massive known graphs
Related Work
• Model-based methods
 Do not scale
 Do not help with counting
• Design-based methods
– Subgraph (or motif) counting tools that use
sampling e.g. MFinder, FANMOD, MODA
 No support for subgraphs of size larger than 10
 No support for vertex attributes
 Biased Estimation
Estimation
Methodology
1. Collect an egocentric network sample H1,..,Hn
a)
Collect a probability sample of “n” nodes from the graph:
uniform independence sampling
weighted independence sampling
link-trace sampling
with replacement
without replacement
Vj, X[Vj]
j=1..n
Methodology
1. Collect an egocentric network sample H1,..,Hn
a)
Vj, X[Vj]
Collect a probability sample of “n” nodes from the graph:
j=1..n
graph G(V,E)
3
2
1
4
C3
n=2
5
8
6
7
Methodology
1. Collect an egocentric network sample H1,..,Hn
Vj, X[Vj]
a) Collect a probability sample of “n” nodes from the graph:
b) Fetch the egonet of each sampled node:
G[Vj]
j=1..n
j=1..n
graph G(V,E)
3
8
C3
6
7
n=2
3
2
4
5
2
1
4
5
8
6
7
Methodology
1. Collect an egocentric network sample H1,..,Hn
j=1..n
Vj, X[Vj]
G[Vj]
a) Collect a probability sample of “n” nodes from the graph
b) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
graph G(V,E)
3
8
C3
6
7
n=2
3
2
4
5
2
1
4
5
8
6
7
Methodology
1. Collect an egocentric network sample H1,..,Hn
j=1..n
Vj, X[Vj]
G[Vj]
a) Collect a probability sample of “n” nodes from the graph
b) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
– can use existing exact clique counting algorithms
– clique type is determined by counting algorithm.
graph G(V,E)
3
8
C3
7
6
1
n=2
3
2
4
5
0
2
1
4
5
8
6
7
Methodology
1. Collect an egocentric network sample H1,..,Hn
j=1..n
Vj, X[Vj]
G[Vj]
a) Collect a probability sample of “n” nodes from the graph
b) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
3. Apply estimation method that combines calculations
– Clique Degree Sums (CDS)
– Distinct Clique Counting (CC)
graph G(V,E)
3
8
C3
7
6
1
n=2
3
2
4
5
0
2
1
4
5
8
6
7
Methodology
1. Collect an egocentric network sample H1,..,Hn
j=1..n
Vj, X[Vj]
G[Vj]
a) Collect a probability sample of “n” nodes from the graph
b) Fetch the egonet of each sampled node
2. Calculate the clique count Ci (or Cu) in each egonet Hj
3. Apply estimation method that combines calculations
– Clique Degree Sums (CDS)
o
labeling of neighbors not required, more space efficient
– Distinct Clique Counting (CC)
o
8
C3
graph G(V,E)
higher accuracy
7
6
1
3
n=2
3
2
4
5
0
2
1
4
5
8
6
7
Labeling of neighbors
C3
8
7
1
9
6
5
4
3
graph G
2
Labeling of neighbors
Vj, X[Vj], G[Vj]
C3
8
7
1
9
6
5
4
3
graph G
2
n=2
Labeling of neighbors
• Distinct Clique Counting (CC)
– labeled neighbors
C3
8
8
7
9
6
7
5
1
9
6
5
4
3
graph G
9
2
6
5
4
3
n=2
Labeled Neighbors
Labeling of neighbors
• Distinct Clique Counting (CC)
– labeled neighbors
• Clique Degree Sums (CDS)
– unlabeled neighbors
C3
8
8
7
9
6
7
9
6
5
4
3
graph G
9
2
9
6
5
5
1
Labeled Neighbors
5
4
3
6
5
4
3
n=2
Unlabeled Neighbors
Clique Degree Sums
unlabeled neighbors
Order-i Clique Degree dij contains
the number of i-cliques that node j belongs
Clique Degree Sums
unlabeled neighbors
Order-i Clique Degree dij contains
the number of i-cliques that node j belongs
H8
6
graph G (V,E)
4
3
8
7
5
2
1
d38 = 2
C3
Clique Degree Sums
unlabeled neighbors
All nodes
Number of i-cliques
that node j belongs
Di is the Order-i Clique Degree Sum
Clique Degree Sums
unlabeled neighbors
All nodes
d38
Number of i-cliques
that node j belongs
6
graph G (V,E)
4
3
8
7
5
Di is the Order-i Clique Degree Sum
2
1
C3
D3 = d31 + d32 + d33 + d34 + d35 +d36 + d37 + d38
D3 = 1 + 1 + 0 + 1 + 2 + 1 + 1 +
2
D3 =
D3 =
9
3C3
Clique Degree Sums
unlabeled neighbors
All nodes
Number of i-cliques
that node j belongs
Sampled nodes
Node j inclusion
probability
is a design-unbiased Horvitz-Thompson estimator
(
)
Clique Degree Sums
unlabeled neighbors
All nodes
Number of i-cliques
that node j belongs
Number of u-cliques
that node j belongs
Sampled nodes
Node j inclusion
probability
is a design-unbiased Horvitz-Thompson estimator
(
)
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive
unbiased estimators of the variance of
and
Node inclusion
probability
Joint node
inclusion probability
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive
unbiased estimators of the variance of
and
• Uniform Independence Sampling
• Weighted Independence Sampling
• Link-trace Sampling
• Without replacement
• With replacement
Clique Degree Sums
Estimator Variance
We can use Horvitz-Thompson theory to derive
unbiased estimators of the variance of
and
• Uniform Independence Sampling
• Without replacement
Sampled nodes
All nodes
Node inclusion
probability
Joint node
inclusion probability
Distinct Clique Counting
labeled neighbors
number of distinct i-cliques
in H1, .., Hn
i-clique inclusion probability
is a design-unbiased Horvitz-Thompson estimator
• Uniform Independence Sampling
• Weighted Independence Sampling
• Link-trace Sampling
(
• With replacement
• Without replacement
)
)
Distinct Clique Counting
labeled neighbors
number of distinct i-cliques
in H1, .., Hn
i-clique inclusion probability
is a design-unbiased Horvitz-Thompson estimator
• Uniform Independence Sampling
(
• With replacement
)
)
Distinct Clique Counting
labeled neighbors
6
a
b
C3
c
graph G
4
8
7
 k  1  (1  3 / 8)4
3
5
N=8
2
1
n=4 UIS with replacement
Distinct Clique Counting
labeled neighbors
6
a
b
C3
graph G
4
8
c
7
3
5
 k  1  (1  3 / 8)4
2
N=8
1
n=4 UIS with replacement
Observed
order-3
cliques
6
6
8
7
5
5
2
2
8
1
7
Distinct
order-3
cliques
6
1
5
2
8
7
Cˆi  2 /(1  (1  3 / 8)4 )  2.36
1
Computational complexity
• Space complexity to count Ci or Cu
– O(1) for Clique Degree Sums Method
– O(ci) or O(cu) for Distinct Clique Counting Method
• Time complexity
– from O(3N/3) to O(n*3D/3) where N is the graph size,
D is the maximum degree, and n is the sample size
– from O(n*3D/3) to O(3D/3) via parallel computations
per egonet
Benefits of our methodology
• Full knowledge of graph not required
• Fast estimation for massive known graphs
• Estimation or exact computation easily
parallelizable for massive known graphs
• Estimation with or without neighbor labels
• Supports vertex attributes
• Supports a variety of sampling designs
Results
Simulation Results
Simulation Results
Facebook New Orleans
Distinct Clique Counting
Clique Degree Sums
Egonet sample size n=1,000
Uniform independence sampling, without replacement
1000 simulations
Simulation Results
1000 simulations
Error metric Normalized Mean Absolute Error :
Clique Degree Sums
Distinct Clique Counting
Simulation Results
Clique Degree Sums
Distinct Clique Counting
Which estimation method to use?
Heuristic
Average Edge Count =
All edges between egos and neighbors
Unique edges between egos and neighbors
6
graph G
4
8
7
3
n=3
5
2
5
6
2
7
1
7
1
a
b
c
5
2
8
8
8
7
N=8
6
6
Average Edge Count =
9
6
= 1.5
1
Estimation Results
Facebook ‘09
• Facebook ‘09 crawled dataset[1]
– 36,628 unique egonets
[1] M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou, “Walking in Facebook: A Case
Study of Unbiased Sampling of OSNs”, IEEE INFOCOM 2010.
Estimation Results
vertex attributes, Facebook ‘09
• Complemented dataset with gender attributes
– about 6 million users
• Unbiased estimation methods of clique distributions
– Clique Degree Sums
– Distinct Clique Counting
• Facebook cliques
• Future work
– support estimation of any subgraphs (beyond cliques)
References
[1] M. Gjoka, E. Smith, C. T. Butts, “Estimating Clique Composition and Size Distributions from Sampled Network Data”, IEEE NetSciCom '14 .
[2] Facebook datasets: http://odysseas.calit2.uci.edu/research/osn.html
[3] Python code for Clique Estimators:
http://tinyurl.com/clique-estimators
Thank you!
Download