Kronecker Graphs: An Approach to Modeling Networks Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, Zoubin Ghahramani Presented by Eric Wang 4/21/2011 Introduction • Modeling real world graphs are important for 2 reasons: – (1) It can readily generate hypothetical graphs for extrapolation and hypothesis testing and – (2) It gives users a useful framework for studying network properties that generative models should obey to be realistic. • In this paper, the authors propose a generative network model called the Kronecker graph that obeys all the static network patterns exhibited in real work graphs. • The three main goals of the paper are – (1) Naturally produce networks where many properties of real networks emerge. – (2) Fast and scalable parameter estimation. – (3) Generate realistic-looking networks that match statistical properties of real networks. Introduction • To address the issue of efficient large-scale parameter estimation, the authors introduce a maximum likelihood based algorithm called KRONFIT. • The fitted model has several interesting applications – – – – – – – – Network Structure. Null-model. Simulations. Extrapolations. Sampling. Graph similarity. Graph visualizations. Anonymization. Network Properties • Degree Distribution: The degree of a node is the number of connections it has. It is heavy tailed and is given by where is the number of nodes with degree d and . • Small diameter: A graph with diameter D means that every pair of nodes can be connected by a path of at most D edges. This value tends to be small for most large real-world graphs. • Hop-plot: Defined as the number of reachable pairs within h hops , where is fraction of connected pairs whose shortest connecting path is at most h hops. Network Properties • Scree plot: A plot of the eigen- or singular values of the graph adjacency matrix versus their rank, using the logarithmic scale. This is found to approximately obey a power law. • Densification power law: States that real networks tend to sprout many more edges than nodes, and thus grow denser. The relationship between the number of edges E(t) and the number of nodes N(t) at time t is where a is typically larger than 1. • Shrinking diameter: The effective diameter of graphs tends to shrink and then stabilize for real world networks. The Challenge of Parameter Estimation • The “standard” method of estimating network models is called the exponential random graph (p* models). This model defines a log-linear model over all possible graphs G, where s(.) is a set of functions that define summary statistics for the structural features of the network. • p* models are useful in modeling small networks and local features, but are prohibitively expensive when the number of nodes is large (>100). • Another challenge is in finding correspondence between a synthetic node and its real world counterpart. Symbols and Notation Kronecker Graph • The Kronecker product C, of matrices for two matrices and B of sizes n x m and n’ x m’ is given by • The Kronecker product of two graphs is simply the Kronecker product of their corresponding adjacency matrices. • A crucial observation: Kronecker Graph Kronecker Graph • Define the kth power of as Kronecker Graph • Formally, a Kronecker graph of order k is defined by the adjacency matrix , where is the Kronecker initiator adjacency matrix. • Several more examples of Kronecker graphs Analysis of Kronecker Graphs • A major advantage of Kronecker graphs is the ability to prove analytical results regarding graph properties, including degree distributions, diameters, eigenvalues, eigenvectors, and timeevolution. • Degree Distribution: Kronecker graphs have multinomial degree distributions, for both in- and out- degrees. A careful choice of the initiator graph makes the resulting multinomial behave like a power law. • Multinomial eigenvalue and eigenvector distributions: The eigenvectors and eigenvalues of a Krocker graph follow multinomial distributions. Analysis of Kronecker Graphs • Connectivity of Kronecker Graphs: If at least one of G or H is a disconnected graph, then is also disconnected. Further, if both G and H are connected but bipartite, then is disconnected, and each of the two connected components is again bipartite. • Densification Power Law: Kronecker graphs follow the densification power law (DPL), with densification exponent • Diameter: If has diameter D and a self-loop on every node, then for every k, the graph also has diameter D. • Effective diameter: If has diameter D and a self-loop on every node, then for every q, the q-effective diameter of approaches D from below as k increases. Stochastic Kronecker Graphs • This particular construction of a stochastic Kronecker graph relaxes the assumption of the binary initiator matrix, and instead allows each entry to take values on the interval [0,1]. • Later, the authors introduce a highly efficient hierarchical sampling scheme to generate an instance of a Kronecker graph. Stochastic Kronecker Graphs • A stochastic Kronecker graph is highly inefficient to store in memory, so it is useful to compute the probability of an edge (u,v) occurring in the kth Kronecker graph in O(k) time: • The recursive nature of stochastic Kronecker graphs also lends itself to a fast generative procedure. Naively generating a stochastic Kronecker graph K on N nodes takes time, while the proposed method takes linear time in the number of edges of the graph. Stochastic Kronecker Graphs • Following Figure c, the authors recursively choose subregions of the graph following the initiator matrix (Figure a) until they reach a single cell (after K steps), and place an edge. Stochastic Kronecker Graphs • Another question that has to be answered is the number of edges in the graph (to be generated). The authors state that the number of edges in the kth Stochastic Kronecker graph is normally distributed with mean • Collisions of edges are rare (1% of edges collide) and simply merit a re-insertion. • Due to this slight error, the proposed generative method does not yield exact samples from the graph parameter distribution, but the authors state that the end result is indistinguishable from graphs generated using the exact naïve procedure. Simulations of Kronecker Graphs • Citation network (CIT-HEP-TH): Simulations of Kronecker Graphs • Autonomous systems (AS-ROUTEVIEWS): Kronecker Graph Model Estimation • In this paper, the authors choose to find an initiator matrix that yields a synthetic Kronecker graph K with the same statistical properties as a real graph G. • This is in contrast to existing parameter estimation schemes that try to optimize by matching statistical properties because it is difficult to specify a set of properties that accurately describe a graph. • The authors choose a maximum likelihood based approach. This presents three challenges: – Model selection/overfitting. – Node Correspondence – Likelihood estimation Kronecker Graph Model Estimation • Consider a graph G on nodes, and an stochastic Kronecker graph initiator matrix that we aim to estimate. • Using where we generate and want to solve are the parameters of . Kronecker Graph Model Estimation • Since the mapping of nodes in G to those in K is unknown, all possible permutations must be considered. Let denote a particular mapping of nodes onto the adjacency matrix . • Define the log likelihood as where since the probability of any given edge is Bernoulli distributed. Kronecker Graph Model Estimation • Now the question becomes, how can we find the best parameters of the initiator matrix? , the • Naively, a grid search could be used, but is highly inefficient. In fact, even a naïve gradient descent algorithm still requires order time. • The reason for this inefficiency is that without a clever way of obtaining a good node permutation , we must sum over all possible permutations, requiring N! time. • The authors next introduce a Metropolis sampling approach that performs the task in linear time. Kronecker Graph Model Estimation • The permutation distribution is where Z is a computationally intractable normalizing constant. • However, Z cancels out if the ratio of the likelihoods between permutations and are computed. Kronecker Graph Model Estimation • The authors define two different proposal distributions to generate permutation from the current permutation . • Empirically, the authors find that they obtain the best performance by executing SwapNodes with probability SwapEdgeEndpoints with probability . • Using this approach, the sampling of time. and can be done in O(kN) Kronecker Graph Model Estimation • Like sampling the permutations, computing the likelihood of a given graph naively is quadratic in the number of nodes. • The authors exploit the sparseness of a real graph, by first calculating the likelihood on an empty graph (with no edges), and then correcting for the edges that actually appear in G. • The log-likelihood on an empty graph is approximated by a scond order Taylor expansion and then correct by subtracted the “no-edge” likelihood and add the “edge” likelihoods Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data Experiments on Real and Synthetic Data