Finding Topk Shortest Path Distance Changes in an Evolutionary

advertisement
Finding Top-k Shortest Path
Distance Changes in an
Evolutionary Network
Manish Gupta Charu Aggarwal Jiawei Han
UIUC
IBM
UIUC
SSTD 2011
24th August 2011
Networks as evolutionary graphs
• Social networks: new users join, new
friendships are created.
• Bibliographic networks: new authors publish
more papers, more collaborations are done.
• Transportation/road networks: new roads are
constructed.
• Ad hoc networks: Army vehicles change
positions very frequently,
new messages transmitted.
Analysis of evolutionary networks
• Community formation, using clustering
techniques
• Metrics to study evolution – merge/split
• Information diffusion across evolutionary
networks
• Link prediction tasks
• Queries over evolving networks
Queries over Evolving networks
• Updating shortest path distance between two
nodes as the edge weights change. E.g., in
computer networks, routers need to update their
shortest path trees when a link goes down.
• Given a time dependent network (edge weights
are function of time), how to compute SPD(u, v,
t).
• Queries incorporating the max flow constraints.
Transportation Planning Problem
• Given the current set of roads, we want to
overlay a network of new roads.
• Civil engineers propose two plans: A and B with
different sets of new roads
• Which plan is better?
• Plan A brings cities X and Y very close. X produces
a lot of product P while Y has a rich demand for
product P.
• Plan A actually brings lots of “economically
important pairs” of cities close to each other.
Select plan A over B.
Our problem
• Given an evolutionary network with two
snapshots G1 and G2.
• Compute top few node pairs with maximum
shortest path distance change across the two
snapshots.
• For example, across 2005 and 2011, distance
between which pair of cities in Illinois
decreased the most, thanks to the new roads
built in this time period?
Naïve Approach
• Compute shortest path distance between every
pair of nodes for snapshot G1.
• Compute shortest path distance between every
pair of nodes for snapshot G2.
• Compute distance change for every pair of nodes.
• Sort the distance change vector
• Return node pairs corresponding to the top few
distance change values.
• Highly inefficient solution!
Solution
• We experiment on three datasets: DBLP co-authorship graph,
IMDB co-starring graph and Ontario province road network.
• Throw in more CPUs!
• Shortest path algorithms are easily parallelizable. Run single
source shortest path runs across thousands of machines.
• On the Ontario road network dataset, it took around 400 CPU
days!
OR
• Use our algorithm
• Our methods are ~50-100X faster than baseline
Outline
• Smartly choose a seed set of few source nodes to
run single source shortest path algorithm from:
Incidence Algorithm.
• Improve the accuracy of Incidence Algorithm by
intelligently expanding the seed set using Edge
importance estimation algorithm.
• Generalize the problem to a node ranking
problem.
• Suggest node ranking strategies.
• Experimental results and analysis.
Incidence Algorithm
• Maximum distance change will happen for
node pairs consisting of nodes on which new
edges or edges with changed weights are
incident.
• Let V’ be the set of nodes with new edges.
• Algorithm: Run single source SPD algorithm
from each node in V’ on both snapshots,
compute difference (change), sort and return
top k.
Is Incidence Algorithm accurate?
• For top 1, yes.
• But not for top k. (k!=1)
• Δ𝑑 𝑎, 𝑑 could be greater
than Δ𝑑 𝑏 ′ , 𝑐′ .
• Multiple edges can combine together and
cause much more distance changes compared
to that by just one edge.
• Solution: To get better accuracy, expand the
seed set.
How to expand the seed set (V’)?
• Consider the neighbors of all the nodes
currently in V’ as potential candidates.
• Expand to a promising neighbor.
• In particular, expand to a neighbor node a, if
Terminate
when
top k node
change.
the
edge that
connects
a topairs
thedon’t
current
set V’
has relatively high importance, relative to
other edges incident on node a.
a
V’
a
V’
Edge importance number
• Importance number of an edge is the probability
that the edge will lie on a randomly chosen
shortest path tree in the graph.
• How to compute edge importance number for
edge e?
• First find all shortest path trees and then find
how many of such trees contain edge e.
• Too expensive! As inefficient as the naïve solution
itself!
• Hence we compute estimate edge importance
number using a randomized algorithm.
Edge Importance Estimation Algorithm
• Randomly sample a few nodes from the graph.
• Using each of these nodes S as source, obtain a shortest path tree T
using an SPD algorithm (e.g. Dijkstra).
• For each tree T, perform distance labeling.
• Alternative Tight edge:
An alternative edge
which could replace an existent
edge from T to give T’.
• For each edge in T, obtain multiple T’
by replacing a tight edge using an
alternative tight edge.
• Edge importance of an edge wrt T
is proportional to the number of
descendants.
• Aggregate I(edge) across all different SPTs.
Generalizing the problem
• Naïve solution: Use all nodes in both
snapshots.
• Incidence algorithm: Use only nodes in V’.
• Generalized solution?
• Node ranking problem.
• Rank nodes such that running Dijkstra
algorithm from just top few nodes provides
high accuracy for “topK node pairs with max
distance change problem”.
How to rank nodes?
• Random: Randomly select nodes from the
graph.
• RandomNWNE: Randomly select nodes from
seed set V’ (nodes with new edges).
• Edge Weight Based Ranking (EWBR).
• Edge Weight Change Based Ranking (EWCBR).
0.2
0.1
0.2
0.1
0.2
0.1
0.01
0.02
0.15
0.3
0.2
0.3
0.1
How to rank nodes?
• Importance Number Based Ranking (INBR)
• Importance Number Change Based Ranking
(INCBR)
0.2
0.1
0.2
0.1
0.2
0.1
0.02
0.5
0.75
0.3
0.2
0.3
0.1
• Ranking Using Edge Weight and Importance
Numbers (RUEWIN)
How to rank nodes?
• Clustering Based Ranking (CBR)
• Clustering Based Ranking with Partitions (CBRP)
• Inter-cluster edges are more important than intra-cluster
edges.
Clustering Based Ranking
• How to estimate the distance saved by an edge e joining nodes u
and v in new snapshot?
• Distance saved = weight of edge e minus the SPD(u,v) in old
snapshot.
• How to estimate SPD(u,v) in old snapshot?
• SPD(u,v) in old snapshot ≈ SPD(u, Cu)+SPD(Cu, Cv)+SPD(Cv, v) where
Cu and Cv are centers of clusters/partitions containing u and v
respectively.
• CBR: Randomly select K nodes in the graph, run Dijkstra from each
of the K nodes. Rank edges and hence nodes.
• CBRP: Similar to CBR except that first partition graph using some
graph partitioning algorithm (e.g. METIS) and then randomly
choose a node within each partition.
• Over-estimates SPD(u.v) in old snapshot for intra-cluster edges but
not a worry!
Experiments
Related work
• Shortest path algorithms: Dijkstra [11], Shimbel
[20], Johnson [15], Floyd, Warshall [14,21]
• Router networks [8,22]
• Outlier detection [5,13,18]
• Time dependent shortest paths [25,26]
• Dynamic shortest paths computation [3,4,6,19]
• Between-ness measures [23,24]
References
References
References
Thanks!
Download