Sampling from Large Graphs Motivation • Our purpose is to analyze and model social networks – An online social network graph is composed of millions of nodes and edges – In order to analyze it we have to store the whole graph in the computers memory – Sometimes this is impossible – Even when it is possible it is extremely time consuming only to compute some basic graph properties – Thus we need to extract a small sample of the graph and analyze it Problem • Given a huge real graph, how can we derive a representative sample? – Which sampling method to use? – How small can the sample size be? – How do we measure success? Problem • What do we compare against? – Scale down sampling: • Given a graph G with n nodes, derive a sample graph G’ with n’ nodes (n’ << n) that will be most similar to G – Back in time sampling: • Let Gn’ denote graph G at some point in time when it had n’ nodes • Find a sample S on n’ nodes that is most similar to Gn’ (when graph G had the same size as S) Evaluation Techniques • Criteria for scale down sampling – – – – – – – – In degree distribution Out degree distribution Distribution of sizes of weakly connected components Distribution of sizes of strongly connected components Hop plot, number of reachable pairs of nodes at distance h Hop plot on the largest WCC Distribution of the clustering coefficient Distribution of singular values of the graph adjacency matrix versus the rank Evaluation Techniques • Criteria for back in time sampling – Densification Power Law: • Number of edges vs number of nodes over time – The effective diameter of the graph over time • Observed that shrinks and stabilizes over time – Normalized size of the largest WCC over time – Average clustering coefficient over time – Largest singular value of graph adjacency matrix over time Statistical Tests • Comparing graph patterns using Kolmogorov-Smirnov Dstatistic – Measure the agreement between two distributions using D = maxx{|F’(x) – F(x)|} – Where F and F’ are two cumulative distribution functions – Does not address the issue of scaling – Just compares the shape of the distributions • Comparing graph patterns using the visiting probability – For each node u E G, calculate the probability of visiting node w EG – Use of Frobenius norm to calculate the difference in visiting probability. Algorithms • Sampling by random node selection – Random Node Sampling: • Uniformly at random select a set of nodes – Random PageRank sampling • Set the probability of a node being selected into the sample proportional to its PageRank weight – Random Degree Node • Se the probability of a node being selected into the sample proportional to its degree Algorithms • Sampling by random edge selection – Random edge sampling • Uniformly select edges at random – Random node – edge sampling • Uniformly at random select a node, then uniformly at random select an edge incident to it – Hybrid sampling • With probability p perform RNE sampling, with probability 1-p perform RE sampling Algorithms • Sampling by exploration – Random node neighbor • Select a node uniformly at random together with all his out-going neighbors – Random walk sampling • Uniformly at random select a random node and perform a random walk with restarts • If we get stuck, randomly select another node to start – Random jump sampling • Same as random walk sampling but with a probability p we jump to a new node – Forest fire sampling • Choose a node u uniformly at random • Generate a random number z and select z out links of u that are not yet visited • Apply this step recursively for all z links selected Evaluation • Three groups of algorithms: – RDN, RJ, RW: biased towards high degree nodes and densely connected part of the graph – FF, RPN, RN: not biased towards high degree nodes, match the temporal densification of the true graph – RE, RNE, HYB: For small sample size the resulting graph is very sparsely connected • Conclusion: – For the scale down goal methods based on random walks perform best – For the back in time goal forest fire algorithm performs best – No single perfect answer to graph sampling – Experiments showed that a 15% sample is usually enough Further thoughts • Wrong approach trying to match all properties? Maybe we should try matching one at a time • Test methods for sampling on graphs with weighted – labeled edges • Current algorithms are extremely slow when we read a graph from a file – Need to implement better versions of them in order to decrease the I/O cost Bibliography • Sampling from large graphs, J. Leskovec and C. Faloutsos • Unbiased sampling of Facebook, M. Gjoka, M. Kurant, C. T. Butts and A. Markopoulou • What is the real size of a sampled network? The case of the Internet, F. Viger, A. Barrat. L. Dall’Asta, C. Zhang and E. D. Kolaczyk • Sampling large Internet topologies for simulation purposes, V. Krishnamurthy, M. Faloutsos, M. Chrobak, J. Cui, L. Lao and A. G. Percus