Using Structure Indices for Efficient Approximation of Network

advertisement
Using Structure Indices for
Efficient Approximation of
Network Properties
Matthew J. Rattigan, Marc Maier, and David Jensen
University of Massachusetts Amherst
Data Mining
November 27, 2006
Deborah Stoffer
The Problem

Recent research works with very large networks


Millions of nodes
Calculating network statistics on very large
networks can be difficult


Shortest paths
Betweenness centrality


The proportion of all shortest paths in the network that run
through a given node
Closeness centrality

The average distance from the given node to every other node
in the network
The Problem

The most efficient known algorithms for
calculating betweenness centrality and closeness
centrality are O(ne + n2logn)



n – number of nodes
e – number of edges
Calculations for path finding can have even
higher complexity

Require bidirectional breadth-first search
The Problem

Example - Rexa citation graph



Papers in computer science and related fields
Largest connected component contains 165,000
nodes (papers) and 321,000 edges (citations)
Finding a path of length 15 requires the exploration of
65,000 nodes
The Problem
Network Structure Index (NSI)



Similar to the type of index commonly used to speed
queries in modern database systems
Can be constructed once for a given graph and then used
to speed the calculations of many measures on the graph
Two components of a NSI

Set of annotations on every node in the network that provide
information about relative or absolute location


For G(V,E) the annotations define A: V → S, where S is an
arbitrarily complex “annotation space”
A distance function that uses the annotations to define graph
distance between pairs of nodes by mapping pairs of node
annotations to a positive real number

D: S x S → R
Types of Network Structure Indices
All Pairs Shortest Path (APSP)
 Degree
 Landmark
 Global Network Positioning (GNP)
 Zone
 Distance to Zone (DTZ)

All Pairs Shortest Path NSI

Node annotations


Consist of an n x n matrix (n = |V|) containing the
optimal path distances between all pairs of nodes
Distance function

A simple lookup in the matrix
Degree NSI

Node annotations


Annotate each node with its undirected degree within
the graph
Distance function between source node s and
target node t

DDegree (s, t) = 2n – degree (s) – degree (t)
Landmark NSI
Randomly designate a small number of nodes in
the network to serve as navigational beacons
 Node annotations




Annotate nodes in the graph by flooding out from
each landmark and recording the graph distance to
each node in the network
Gives a vector of graph distances for each node
Distance function

Landmark NSI
Global Network Positioning NSI

Node annotation


Annotation uses a nonlinear optimization algorithm
to create a multidimensional coordinate system that
encodes the location of each node within the network
Distance function is the Manhattan distance
between node pairs

Zone NSI

Node annotations


Each node is annotated with a d-dimensional vector of
zone labels
Distance function

Zone NSI Algorithm

For d dimensions



Randomly select k seed nodes, assign them zone
labels 1 through k, and place them in the labeled set
Place all other nodes in the unlabeled set
While the unlabeled set is not empty



Randomly select a node l from the labeled set
Randomly select a node u from the unlabeled set that is a
neighbor to l
Assign u to the same zone as l and move it to the labeled set
Zone NSI
Distance to Zone (DTZ) NSI
Hybrid between Landmark and Zone NSIs
 Node annotations



Divide the graph into zones and for each node u and
zone Z calculate the distance from u to the closest
node in Z
Distance function

Distance to Zone (DTZ) NSI
Complexity of Different NSIs
Search Performance

Optimality of the lengths of paths found

Path ratio






pf is the length of the found paths
po is the length of the optimal paths
r is the number of randomly selected pairs of nodes in
the graph
P = 1.0 indicates an NSI that finds optimal paths
P >> 1.0 indicates a poor performing NSI
Search Performance

Performance gain

Exploration ratio






ef is the number of nodes explored by best-first search
eb is the number of nodes that are explored using a
bidirectional breadth-first search
r is the number of pairs of nodes in the graph
E values close to zero indicate good search performance
E values greater than 1.0 indicate poor search
performance
Search Performance

NSIs evaluated on synthetic graphs



Random
Rewired lattices
Forest Fire
Search Performance
Search Performance
Search Performance
Search Performance
Constant Time Distance Estimation
Can sometimes use an NSI to directly estimate the
graph distance between any two nodes
 Can use the DTZ annotation distance to estimate
actual graph distances




Annotate the graph as described for the DTZ NSI
Randomly sample p pairs of nodes in the graph and
perform breadth-first search to obtain their exact graph
distance
Use linear regression to obtain an equation for
estimated distance
Constant Time Distance Estimation
Constant Time Distance Estimation
Constant Time Distance Estimation

Simple distance can be used to produce a wide
variety of attributes on nodes, which can be used
by data mining algorithms that analyze graphs

Label nodes with their distance to a particular node in a
graph


How close is each actor to Kevin Bacon?
Label nodes with the minimum or maximum distance
to one of a set of designated nodes

How close is each actor to an Academy Award winner?
Closeness Centrality

Measures the proximity of a given node in a
network to every other node

Important to social network dynamics
 Accurate estimates of closeness centrality often
impossible to calculate for large data sets
 Using an NSI for path finding can estimate
closeness centrality efficiently

Closeness Centrality
Closeness Centrality

A measure of centrality can be used to produce
attributes on nodes that may be useful to
knowledge discovery algorithms

Determine the closeness of every node to a collection
of key nodes


Constrain closeness calculations for members of
clusters


Closeness to all winners of Academy Awards for best actor in
the past 10 years
Closeness rank of an actor within their movie industry
Weight closeness based on the attributes of the
outlying nodes

Closeness to winners of Academy Awards weighted by how
recent an award
Betweenness Centrality

Measures the number of short paths on which a
given node lies

Important to social network dynamics
 Accurate estimates of betweenness centrality
often impossible to calculate for large data sets

Betweenness Centrality
Can estimate betweenness using the paths
identified through NSI navigation
 Randomly sample pairs of nodes and discover the
shortest path between them
 Count the number of times each node in the graph
appears on one of these paths to obtain a
betweenness ranking

Betweenness Centrality
Betweenness Centrality

A high betweenness score can indicate a bridge
between two communities


An actor that has played in movies belonging to
different movie industries
Betweenness centrality can be used to create
features on nodes that are useful for data mining

Calculate betweenness centrality for particular groups
of nodes

Actors that sit between winners of Academy Awards for best
picture and the IMDb’s “Bottom 100”, the worst 100 movies as
voted by users of the Internet Movie Database
Conclusions
The NSIs Zone and DTZ allow efficient and
accurate estimation of path lengths between
arbitrary nodes in a network
 Efficient calculations of network statistics allow a
better range of potential approaches to knowledge
discovery
 All potential NSIs have not been exhaustively
researched
 NSIs could have other applications



Finding connection subgraphs
Approximating neighborhood functions
Questions?
Download