Finding Top-k Shortest Path Distance Changes in an Evolutionary Network Manish Gupta Charu Aggarwal Jiawei Han UIUC IBM UIUC SSTD 2011 24th August 2011 Networks as evolutionary graphs • Social networks: new users join, new friendships are created. • Bibliographic networks: new authors publish more papers, more collaborations are done. • Transportation/road networks: new roads are constructed. • Ad hoc networks: Army vehicles change positions very frequently, new messages transmitted. Analysis of evolutionary networks • Community formation, using clustering techniques • Metrics to study evolution – merge/split • Information diffusion across evolutionary networks • Link prediction tasks • Queries over evolving networks Queries over Evolving networks • Updating shortest path distance between two nodes as the edge weights change. E.g., in computer networks, routers need to update their shortest path trees when a link goes down. • Given a time dependent network (edge weights are function of time), how to compute SPD(u, v, t). • Queries incorporating the max flow constraints. Transportation Planning Problem • Given the current set of roads, we want to overlay a network of new roads. • Civil engineers propose two plans: A and B with different sets of new roads • Which plan is better? • Plan A brings cities X and Y very close. X produces a lot of product P while Y has a rich demand for product P. • Plan A actually brings lots of “economically important pairs” of cities close to each other. Select plan A over B. Our problem • Given an evolutionary network with two snapshots G1 and G2. • Compute top few node pairs with maximum shortest path distance change across the two snapshots. • For example, across 2005 and 2011, distance between which pair of cities in Illinois decreased the most, thanks to the new roads built in this time period? Naïve Approach • Compute shortest path distance between every pair of nodes for snapshot G1. • Compute shortest path distance between every pair of nodes for snapshot G2. • Compute distance change for every pair of nodes. • Sort the distance change vector • Return node pairs corresponding to the top few distance change values. • Highly inefficient solution! Solution • We experiment on three datasets: DBLP co-authorship graph, IMDB co-starring graph and Ontario province road network. • Throw in more CPUs! • Shortest path algorithms are easily parallelizable. Run single source shortest path runs across thousands of machines. • On the Ontario road network dataset, it took around 400 CPU days! OR • Use our algorithm • Our methods are ~50-100X faster than baseline Outline • Smartly choose a seed set of few source nodes to run single source shortest path algorithm from: Incidence Algorithm. • Improve the accuracy of Incidence Algorithm by intelligently expanding the seed set using Edge importance estimation algorithm. • Generalize the problem to a node ranking problem. • Suggest node ranking strategies. • Experimental results and analysis. Incidence Algorithm • Maximum distance change will happen for node pairs consisting of nodes on which new edges or edges with changed weights are incident. • Let V’ be the set of nodes with new edges. • Algorithm: Run single source SPD algorithm from each node in V’ on both snapshots, compute difference (change), sort and return top k. Is Incidence Algorithm accurate? • For top 1, yes. • But not for top k. (k!=1) • Δ𝑑 𝑎, 𝑑 could be greater than Δ𝑑 𝑏 ′ , 𝑐′ . • Multiple edges can combine together and cause much more distance changes compared to that by just one edge. • Solution: To get better accuracy, expand the seed set. How to expand the seed set (V’)? • Consider the neighbors of all the nodes currently in V’ as potential candidates. • Expand to a promising neighbor. • In particular, expand to a neighbor node a, if Terminate when top k node change. the edge that connects a topairs thedon’t current set V’ has relatively high importance, relative to other edges incident on node a. a V’ a V’ Edge importance number • Importance number of an edge is the probability that the edge will lie on a randomly chosen shortest path tree in the graph. • How to compute edge importance number for edge e? • First find all shortest path trees and then find how many of such trees contain edge e. • Too expensive! As inefficient as the naïve solution itself! • Hence we compute estimate edge importance number using a randomized algorithm. Edge Importance Estimation Algorithm • Randomly sample a few nodes from the graph. • Using each of these nodes S as source, obtain a shortest path tree T using an SPD algorithm (e.g. Dijkstra). • For each tree T, perform distance labeling. • Alternative Tight edge: An alternative edge which could replace an existent edge from T to give T’. • For each edge in T, obtain multiple T’ by replacing a tight edge using an alternative tight edge. • Edge importance of an edge wrt T is proportional to the number of descendants. • Aggregate I(edge) across all different SPTs. Generalizing the problem • Naïve solution: Use all nodes in both snapshots. • Incidence algorithm: Use only nodes in V’. • Generalized solution? • Node ranking problem. • Rank nodes such that running Dijkstra algorithm from just top few nodes provides high accuracy for “topK node pairs with max distance change problem”. How to rank nodes? • Random: Randomly select nodes from the graph. • RandomNWNE: Randomly select nodes from seed set V’ (nodes with new edges). • Edge Weight Based Ranking (EWBR). • Edge Weight Change Based Ranking (EWCBR). 0.2 0.1 0.2 0.1 0.2 0.1 0.01 0.02 0.15 0.3 0.2 0.3 0.1 How to rank nodes? • Importance Number Based Ranking (INBR) • Importance Number Change Based Ranking (INCBR) 0.2 0.1 0.2 0.1 0.2 0.1 0.02 0.5 0.75 0.3 0.2 0.3 0.1 • Ranking Using Edge Weight and Importance Numbers (RUEWIN) How to rank nodes? • Clustering Based Ranking (CBR) • Clustering Based Ranking with Partitions (CBRP) • Inter-cluster edges are more important than intra-cluster edges. Clustering Based Ranking • How to estimate the distance saved by an edge e joining nodes u and v in new snapshot? • Distance saved = weight of edge e minus the SPD(u,v) in old snapshot. • How to estimate SPD(u,v) in old snapshot? • SPD(u,v) in old snapshot ≈ SPD(u, Cu)+SPD(Cu, Cv)+SPD(Cv, v) where Cu and Cv are centers of clusters/partitions containing u and v respectively. • CBR: Randomly select K nodes in the graph, run Dijkstra from each of the K nodes. Rank edges and hence nodes. • CBRP: Similar to CBR except that first partition graph using some graph partitioning algorithm (e.g. METIS) and then randomly choose a node within each partition. • Over-estimates SPD(u.v) in old snapshot for intra-cluster edges but not a worry! Experiments Related work • Shortest path algorithms: Dijkstra [11], Shimbel [20], Johnson [15], Floyd, Warshall [14,21] • Router networks [8,22] • Outlier detection [5,13,18] • Time dependent shortest paths [25,26] • Dynamic shortest paths computation [3,4,6,19] • Between-ness measures [23,24] References References References Thanks!