Betweenness Metric Surya Challa (chillinmoon) Nov 1, 2010 Betweenness is a metric applied to a vertex within a weighted graph. For the purposes of this problem we will define betweenness of vertex T as the number of shortest paths between two vertices in the graph that includes the vertex T, but does not start or end with vertex T. Vertices that occur on many shortest paths between other vertices have higher betweenness scores than those that do not. Betweenness Centrality ‘Betweenness’ is closely related to ‘Betweenness Centrality’, one of the various measures of the ‘Centrality’ of a vertex within a graph that determine the relative importance of a vertex within the graph. We will look at the relation between ‘Betweenness’ in the ‘Betweenness Centrality’ implementation section. Formal Definition: Given a graph, G(V, E), we can define S[s][t](v) as the number of shortest paths from vertex s (start) to vertex t (end) that includes vertex v, where s, t, and v are vertices in V. The betweenness score of a given vertex v in V is the sum of S[s][t](v) for all s and t in V, such that s!= v, t!=v. Problem Description: We will develop a threaded program that reads a weighted graph and compute the betweenness score for each vertex within the graph. The command line parameters will include the file name of the text file holding the directed graph and a parameter to indicate how much to output from the application. Since all shortest paths in the graph are to be taken into account--even the path from a vertex to itself-the betweenness score of the start and end vertices from each path are not changed when considering that path. Thus, if the shortest path between two vertices is a direct edge, there are no vertices between those two and no change in betweenness scores for any vertex due to that path. On the other hand, if there are multiple paths with the same total weight that are shortest, the betweenness score for each vertex found along the path (other than the start and end vertices) will be incremented for each time the vertex appears within any shortest path. For example, if the paths "2 1 0" and "2 1 4 0" both had the shortest length from vertex "2" to vertex "0", the score for vertex "1" would increment by 2 and the score for vertex "4" would increment by 1. Input Description: The input to the program will come from a text file named on the application command line. The file will include multiple lines. The first line will contain a single integer specifying the number of vertices within the graph (N). Each remaining line will contain edges of the graph using three integers (i, j, w). The first two integers represent the (zero-based) index of the start (i) and end (j) vertices of the edge, and the third integer represents the weight (w) associated with that edge. The graph will not contain an edge from a vertex to itself. The second command line parameter will be a single integer (K) that specifies the number of vertices with highest betweenness score and number of vertices with the lowest betweenness score to be output. Output Description: The output to be generated by the application is a list of K vertices with the highest betweenness score and K vertices with the lowest betweenness score along with the computed betweenness score. Output will be printed to stdout. Both lists should be printed in sorted order based on the betweenness score of the vertices. If more than one vertex shares the same betweenness score, an arbitrary choice should be made for the order of printing such vertices or if a vertex should be included in the output or not. For example, if the requirement is to print the five lowest scoring vertices and seven vertices have a score of 0, any five of the seven may be printed. Command line example: Btween.exe indata01.txt 2 Input file example, indata01.txt: 5 0 1 12 1 0 16 027 215 3 0 13 0 4 15 1 3 11 3 1 10 4 0 12 144 233 429 348 4 3 15 Output file example: Top 2 betweenness vertices Vertex score 2 4 8 6 Low 2 betweenness vertices Vertex score 3 0 2 0 Timing: The entire execution time for the application will be used for scoring. For most accurate timing results, code would include timing code to measure and print the execution time to stdout, otherwise an external stopwatch will be used to measure the execution time. Additional Clarifications: Additional clarifications were provided by Judges at the contest forum (http://software.intel.com/en-us/forums/p2-m4-betweenness-metric/). Below is a summary: 1. Graph will not contain any parallel edges 2. Graph is Connected, i.e., every vertex can be reached from every other vertex 3. Uint64_t will be sufficient to hold betweenness, ‘signed 32-bit integer’ will be sufficient for number of vertices, Edge weights will all be positive and will fit in ‘signed 32-bit integer’. 4. For the purposes of this problem, at least one edge must be travelled from one vertex to another to qualify as a shortest path. It is assumed that the weights along the diagonal (from a node to itself) are infinity. Thus, the path 0-0 would not be a shortest path. Serial Algorithm: Measures of Centrality are used to determine the relative importance of a vertex with a graph, such as how important a person is within a social network, or, how important a room is within a building or how well-used a road is within an urban network. We will derive ‘Betweenness’ algorithm from that of the ‘Betweenness Centrality’. Betweenness Centrality: From ‘http://en.wikipedia.org/wiki/Centrality’ : For a graph G: = (V,E) with n vertices, the betweenness CB(v) for vertex v is computed as follows: 1. For each pair of vertices (s,t), compute all shortest paths between them. 2. For each pair of vertices (s,t), determine the fraction of shortest paths that pass through the vertex in question (here, vertex v). 3. Sum this fraction over all pairs of vertices (s,t). Or, more succinctly: where σst is the number of shortest paths from s to t, and σst (v) is the number of shortest paths from s to t that pass through a vertex v. Ulrik Brandes’s algorithm [2] is considered to be the fastest algorithm to compute the betweenness centrality in O(nm + n2 log n) time, where n and m are the number of vertices and edges in the graph, respectively. We have adapted Brandes algorithm to count total number of shortest paths going through a vertex from a given ‘source vertex’. We repeatedly call ‘Dijsktra’s shortest path algorithm [1] to allow it to identify the total of all shortest paths originating from all vertices in graph. 1. For each vertex in the graph a. Use modified brandes algorithm to determine the Shortest Paths count terminating on each vertex in the graph b. Compute the total shortest paths passing through every vertex using the formula below (derived using the same approach used in [2]) σs(v) = ∑ σsv (1 + σs(w)/ σsw) w:vεPs (w) σsv - Number of shortest paths from ‘s’ to ‘v’ σst(v) - Number of shortest paths from ‘s’ to ‘t’ ε V which have ‘v’ on them σs(v) - ∑ σst(v) s != v != t ε V Ps (v) = { u ε V : { u,v} ε E; dG(s, v) = dG (s, u) + W(u, v)} dG(s, v) – Shortest Distance from ‘s’ to ‘v’ W(u, v) – Weight of the edge ‘u’ to ‘v’ Parallel Algorithm: 1. In the parallel algorithm, we make each Dijkstra’s call on independent thread. 2. We use reduce step (of parallel-reduce) to sum up the betweenness array computed from each source vertex to final betweenness. Timings Nodes 2000 3500 Edges 20 3500 Serial version (ms) 2118 166,480 Parallel version (ms) 305 14,751 Conclusion: The solution is constricted by the memory bandwidth than the processing speed. In my experimentation, increasing the number of threads negatively affected the performance. An efficient implementation of ‘Priority Queue’ used in ‘Dijkstra’s algorithm’ could reduce the time taken significantly. As per theory ‘Fibonacci Heap’ should give one of the best performances, but in my experiments it showed a lower performance over ‘STL’s priority queue’. This may be due to loss of ‘locality’ of internal data structure’s memory. ‘STL’s priority_queue performed even better than ‘Binary Heap’, but most probably that could be due to ‘not so optimized’ implementation of ‘Binay Heap’. Other heaps of interest, which need to be studied for their effect on Betweenness computation are ‘Relaxed Heap’, ‘van Emde Boas tree’. References [1] E. W. Dijkstra, “A note on two problems inconnexion with graphs,”Numerische Mathematik, 1, S. 269-271, 1959. [2] U. Brandes, “A faster algorithm for betweenness centrality,” Journal of Mathematical Sociology, Vol. 25, pp. 163-199, 2001.