Betweenness is a metric applied to a vertex within a weighted graph

advertisement
Betweenness Metric
Surya Challa (chillinmoon)
Nov 1, 2010
Betweenness is a metric applied to a vertex within a weighted graph. For the purposes of this
problem we will define betweenness of vertex T as the number of shortest paths between two vertices
in the graph that includes the vertex T, but does not start or end with vertex T. Vertices that occur on
many shortest paths between other vertices have higher betweenness scores than those that do not.
Betweenness Centrality
‘Betweenness’ is closely related to ‘Betweenness Centrality’, one of the various measures of the
‘Centrality’ of a vertex within a graph that determine the relative importance of a vertex within the
graph. We will look at the relation between ‘Betweenness’ in the ‘Betweenness Centrality’
implementation section.
Formal Definition:
Given a graph, G(V, E), we can define S[s][t](v) as the number of shortest paths from vertex s (start) to
vertex t (end) that includes vertex v, where s, t, and v are vertices in V. The betweenness score of a given
vertex v in V is the sum of S[s][t](v) for all s and t in V, such that s!= v, t!=v.
Problem Description:
We will develop a threaded program that reads a weighted graph and compute the
betweenness score for each vertex within the graph. The command line parameters will include the file
name of the text file holding the directed graph and a parameter to indicate how much to output from
the application.
Since all shortest paths in the graph are to be taken into account--even the path from a vertex to itself-the betweenness score of the start and end vertices from each path are not changed when considering
that path. Thus, if the shortest path between two vertices is a direct edge, there are no vertices between
those two and no change in betweenness scores for any vertex due to that path. On the other hand, if
there are multiple paths with the same total weight that are shortest, the betweenness score for each
vertex found along the path (other than the start and end vertices) will be incremented for each time
the vertex appears within any shortest path. For example, if the paths "2 1 0" and "2 1 4 0" both had the
shortest length from vertex "2" to vertex "0", the score for vertex "1" would increment by 2 and the
score for vertex "4" would increment by 1.
Input Description:
The input to the program will come from a text file named on the application command line. The file will
include multiple lines. The first line will contain a single integer specifying the number of vertices within
the graph (N). Each remaining line will contain edges of the graph using three integers (i, j, w). The first
two integers represent the (zero-based) index of the start (i) and end (j) vertices of the edge, and the
third integer represents the weight (w) associated with that edge. The graph will not contain an edge
from a vertex to itself.
The second command line parameter will be a single integer (K) that specifies the number of vertices
with highest betweenness score and number of vertices with the lowest betweenness score to be
output.
Output Description:
The output to be generated by the application is a list of K vertices with the highest
betweenness score and K vertices with the lowest betweenness score along with the computed
betweenness score. Output will be printed to stdout. Both lists should be printed in sorted order based
on the betweenness score of the vertices. If more than one vertex shares the same betweenness score,
an arbitrary choice should be made for the order of printing such vertices or if a vertex should be
included in the output or not. For example, if the requirement is to print the five lowest scoring vertices
and seven vertices have a score of 0, any five of the seven may be printed.
Command line example: Btween.exe indata01.txt 2
Input file example, indata01.txt:
5
0 1 12
1 0 16
027
215
3 0 13
0 4 15
1 3 11
3 1 10
4 0 12
144
233
429
348
4 3 15
Output file example:
Top 2 betweenness vertices
Vertex score
2
4
8
6
Low 2 betweenness vertices
Vertex score
3
0
2
0
Timing: The entire execution time for the application will be used for scoring. For most accurate timing
results, code would include timing code to measure and print the execution time to stdout, otherwise an
external stopwatch will be used to measure the execution time.
Additional Clarifications:
Additional clarifications were provided by Judges at the contest forum
(http://software.intel.com/en-us/forums/p2-m4-betweenness-metric/). Below is a summary:
1. Graph will not contain any parallel edges
2. Graph is Connected, i.e., every vertex can be reached from every other vertex
3. Uint64_t will be sufficient to hold betweenness, ‘signed 32-bit integer’ will be sufficient for
number of vertices, Edge weights will all be positive and will fit in ‘signed 32-bit integer’.
4. For the purposes of this problem, at least one edge must be travelled from one vertex to
another to qualify as a shortest path. It is assumed that the weights along the diagonal (from a
node to itself) are infinity. Thus, the path 0-0 would not be a shortest path.
Serial Algorithm:
Measures of Centrality are used to determine the relative importance of a vertex with a graph,
such as how important a person is within a social network, or, how important a room is within a building
or how well-used a road is within an urban network. We will derive ‘Betweenness’ algorithm from that
of the ‘Betweenness Centrality’.
Betweenness Centrality:
From ‘http://en.wikipedia.org/wiki/Centrality’ :
For a graph G: = (V,E) with n vertices, the betweenness CB(v) for vertex v is computed as
follows:
1. For each pair of vertices (s,t), compute all shortest paths between them.
2. For each pair of vertices (s,t), determine the fraction of shortest paths that pass through the
vertex in question (here, vertex v).
3. Sum this fraction over all pairs of vertices (s,t).
Or, more succinctly:
where σst is the number of shortest paths from s to t, and σst (v) is the number of shortest paths from s
to t that pass through a vertex v.
Ulrik Brandes’s algorithm [2] is considered to be the fastest algorithm to compute the betweenness
centrality in O(nm + n2 log n) time, where n and m are the number of vertices and edges in the graph,
respectively.
We have adapted Brandes algorithm to count total number of shortest paths going through a vertex
from a given ‘source vertex’. We repeatedly call ‘Dijsktra’s shortest path algorithm [1] to allow it to
identify the total of all shortest paths originating from all vertices in graph.
1. For each vertex in the graph
a. Use modified brandes algorithm to determine the Shortest Paths count terminating on
each vertex in the graph
b. Compute the total shortest paths passing through every vertex using the formula below
(derived using the same approach used in [2])
σs(v) = ∑ σsv (1 + σs(w)/ σsw)
w:vεPs (w)
σsv - Number of shortest paths from ‘s’ to ‘v’
σst(v) - Number of shortest paths from ‘s’ to ‘t’ ε V which have ‘v’ on them
σs(v) - ∑ σst(v)
s != v != t ε V
Ps (v) = { u ε V : { u,v} ε E; dG(s, v) = dG (s, u) + W(u, v)}
dG(s, v) – Shortest Distance from ‘s’ to ‘v’
W(u, v) – Weight of the edge ‘u’ to ‘v’
Parallel Algorithm:
1. In the parallel algorithm, we make each Dijkstra’s call on independent thread.
2. We use reduce step (of parallel-reduce) to sum up the betweenness array computed from
each source vertex to final betweenness.
Timings
Nodes
2000
3500
Edges
20
3500
Serial version (ms)
2118
166,480
Parallel version (ms)
305
14,751
Conclusion:
The solution is constricted by the memory bandwidth than the processing speed. In my
experimentation, increasing the number of threads negatively affected the performance. An efficient
implementation of ‘Priority Queue’ used in ‘Dijkstra’s algorithm’ could reduce the time taken
significantly. As per theory ‘Fibonacci Heap’ should give one of the best performances, but in my
experiments it showed a lower performance over ‘STL’s priority queue’. This may be due to loss of
‘locality’ of internal data structure’s memory. ‘STL’s priority_queue performed even better than ‘Binary
Heap’, but most probably that could be due to ‘not so optimized’ implementation of ‘Binay Heap’. Other
heaps of interest, which need to be studied for their effect on Betweenness computation are ‘Relaxed
Heap’, ‘van Emde Boas tree’.
References
[1] E. W. Dijkstra, “A note on two problems inconnexion with graphs,”Numerische Mathematik, 1, S. 269-271, 1959.
[2] U. Brandes, “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, Vol. 25, pp. 163-199, 2001.
Download