Sampling from Large Graphs

advertisement
Sampling from Large Graphs
Motivation
• Our purpose is to analyze and model social
networks
– An online social network graph is composed of
millions of nodes and edges
– In order to analyze it we have to store the whole
graph in the computers memory
– Sometimes this is impossible
– Even when it is possible it is extremely time
consuming only to compute some basic graph
properties
– Thus we need to extract a small sample of the graph
and analyze it
Problem
• Given a huge real graph, how can we
derive a representative sample?
– Which sampling method to use?
– How small can the sample size be?
– How do we measure success?
Problem
• What do we compare against?
– Scale down sampling:
• Given a graph G with n nodes, derive a sample
graph G’ with n’ nodes (n’ << n) that will be most
similar to G
– Back in time sampling:
• Let Gn’ denote graph G at some point in time when
it had n’ nodes
• Find a sample S on n’ nodes that is most similar to
Gn’ (when graph G had the same size as S)
Evaluation Techniques
• Criteria for scale down sampling
–
–
–
–
–
–
–
–
In degree distribution
Out degree distribution
Distribution of sizes of weakly connected components
Distribution of sizes of strongly connected
components
Hop plot, number of reachable pairs of nodes at
distance h
Hop plot on the largest WCC
Distribution of the clustering coefficient
Distribution of singular values of the graph adjacency
matrix versus the rank
Evaluation Techniques
• Criteria for back in time sampling
– Densification Power Law:
• Number of edges vs number of nodes over time
– The effective diameter of the graph over time
• Observed that shrinks and stabilizes over time
– Normalized size of the largest WCC over time
– Average clustering coefficient over time
– Largest singular value of graph adjacency
matrix over time
Statistical Tests
• Comparing graph patterns using Kolmogorov-Smirnov Dstatistic
– Measure the agreement between two distributions using D =
maxx{|F’(x) – F(x)|}
– Where F and F’ are two cumulative distribution functions
– Does not address the issue of scaling
– Just compares the shape of the distributions
• Comparing graph patterns using the visiting probability
– For each node u E G, calculate the probability of visiting node w
EG
– Use of Frobenius norm to calculate the difference in visiting
probability.
Algorithms
• Sampling by random node selection
– Random Node Sampling:
• Uniformly at random select a set of nodes
– Random PageRank sampling
• Set the probability of a node being selected into
the sample proportional to its PageRank weight
– Random Degree Node
• Se the probability of a node being selected into the
sample proportional to its degree
Algorithms
• Sampling by random edge selection
– Random edge sampling
• Uniformly select edges at random
– Random node – edge sampling
• Uniformly at random select a node, then uniformly
at random select an edge incident to it
– Hybrid sampling
• With probability p perform RNE sampling, with
probability 1-p perform RE sampling
Algorithms
• Sampling by exploration
– Random node neighbor
• Select a node uniformly at random together with all his out-going
neighbors
– Random walk sampling
• Uniformly at random select a random node and perform a random
walk with restarts
• If we get stuck, randomly select another node to start
– Random jump sampling
• Same as random walk sampling but with a probability p we jump to
a new node
– Forest fire sampling
• Choose a node u uniformly at random
• Generate a random number z and select z out links of u that are not
yet visited
• Apply this step recursively for all z links selected
Evaluation
• Three groups of algorithms:
– RDN, RJ, RW: biased towards high degree nodes and densely
connected part of the graph
– FF, RPN, RN: not biased towards high degree nodes, match the
temporal densification of the true graph
– RE, RNE, HYB: For small sample size the resulting graph is very
sparsely connected
• Conclusion:
– For the scale down goal methods based on random walks
perform best
– For the back in time goal forest fire algorithm performs best
– No single perfect answer to graph sampling
– Experiments showed that a 15% sample is usually enough
Further thoughts
• Wrong approach trying to match all
properties? Maybe we should try matching
one at a time
• Test methods for sampling on graphs with
weighted – labeled edges
• Current algorithms are extremely slow
when we read a graph from a file
– Need to implement better versions of them in
order to decrease the I/O cost
Bibliography
• Sampling from large graphs, J. Leskovec and C.
Faloutsos
• Unbiased sampling of Facebook, M. Gjoka, M.
Kurant, C. T. Butts and A. Markopoulou
• What is the real size of a sampled network? The
case of the Internet, F. Viger, A. Barrat. L.
Dall’Asta, C. Zhang and E. D. Kolaczyk
• Sampling large Internet topologies for simulation
purposes, V. Krishnamurthy, M. Faloutsos, M.
Chrobak, J. Cui, L. Lao and A. G. Percus
Download