Approximation Algorithms for Betweenness Centrality K ARLSRUHE I NSTITUTE OF T ECHNOLOGY (KIT) · I NSTITUTE OF T HEORETICAL I NFORMATICS · PARALLEL C OMPUTING G ROUP KIT – University of the State of Baden-Wuerttemberg and National Laboratory the Helmholtz Algorithms Associationfor Betweenness Centrality Elisabetta Bergamini –ofApproximation www.kit.edu 1 Introduction | Betweenness centrality BC: participation of nodes in the shortest paths of the network Nodes with high betweenness → lie in many shortest paths between pairs of nodes Given G = (V , E) and v ∈ V : X σst (v ) bC (v ) = σst s ,t ∈ V s6=v 6=t where: σst = number of s.p. between s and t σst (v ) = number of s.p. between s and t that go through v Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality [geoidin.wordpress.com] 2 Recall | Brandes’s algorithm for each node s: First step SSSP (Dijkstra or BFS) from s s While visiting nodes, we also keep track of number of shortest paths and predecessors Second step Sort nodes by decreasing distance from s For each node v , compute dependency as δs (v ) = X w ∈succ(v ) σsv (1 + δs (w)) σsw w add δs (v ) to cB (v ) Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 3 Approximation algorithms Complexity of Brandes’s algorithm: O(nm) for unweighted graphs and O(n(n log n + m)) for weighted graphs too expensive for graphs with millions or billions of edges! approximation!! What do we want from an approximation algorithm? It should give us an unbiased estimator for the betweenness of each node v : E(c̃B (v )) = cB (v ) Ideally, it should give us some guarantee on the quality of the approximation Absolute error guarantee: cB (v ) − ≤ c̃B (v ) ≤ cB (v ) + Relative error guarantee: cB (v )/ρ ≤ c̃B (v ) ≤ cB (v ) · ρ Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 4 A simple approximation [Brandes and Pich, 2006] Choose a set S = {s1 , ..., sk } ⊆ V of k source nodes Each si is chosen uniformly at random in V , i.e. P(si = v ) = 1/n ∀v ∈ V For each si and for each node v , we compute δsi (v ) as in Brandes’s algorithm cB (v ) = n k P si ∈ S δsi (v ) cB (v ) = 12 · 7 + 2 · 18 + 5 · 14 = 190 s3 s2 v c̃B (v ) = (3 · 7 + 18) · s1 s4 Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 20 4 = 195 5 A simple approximation [Brandes and Pich, 2006] Unbiased estimator k = 1 (only 1 source node) E(c̃B (v )) = X1 s∈V n · n · δs (v ) = cB (v ) With k sources, we take the average of the n · δs (v ) among the s∈S The average of the expectations is the expectation of the average c̃B (v ) is an unbiased estimator Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 6 A simple approximation [Brandes and Pich, 2006] Unfortunately, the approach has a major limitation: overestimation of neighbors of degree-1 nodes Consider a degree-1 node w, with neighbor v If w is sampled, the betweenness of node v will be overestimated w v Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 7 A simple approximation [Brandes and Pich, 2006] s3 s5 s2 v s1 s4 cB (v ) = 12 · 7 + 2 · 18 + 5 · 14 = 190 c̃B (v ) = (3 · 7 + 18 + 18) · 20 5 = 228 Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 8 Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 9 A new approach (GSS) [Geisberger et al., 2008] A Generalized Framework for Betweenness Approximation Length function l : E → R For a path P =< e1 , ..., ek >, let l (P) := Pk i=1 l (ei ) Scaling function f : [0, 1] → [0, 1] Let P be in the form P =< s, ..., v , ..., t > and let Psv =< s, ..., v > We define a scaled contribution δP (v ) as f (l (Psv )/l (P)) δP (v ) := σst Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 10 A new approach (GSS) [Geisberger et al., 2008] A Generalized Framework for Betweenness Approximation Given a shortest path P =< s, ..., v , ..., t >, we call P 0 the transposed path, i.e. P 0 =< t, ..., v , ..., s >, which is a shortest path in the transposed graph G0 The scaled contribution for v in P 0 is δP 0 (v ) := s 1 − f (l (Psv )/l (P)) σst v t P’ Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 11 A new approach (GSS) [Geisberger et al., 2008] For each of the k samples: Sample a node x ∈ V uniformly at random With probability 1/2 run a forward SSSP from x, otherwise a backward SSSP from x (a SSSP on G0 ) Forward search: δ(fx ) (v ) := P y ∈V P {δP (v ) : P ∈ SPxy (v )} := P y ∈V P {δP 0 (v ) : P ∈ SPy x (v )} Backward search: δx(b) (v ) (f ) X 2n δx (v ) if forward c̃B (v ) = δx (v ) = (b) k k δ x (v ) if backward x ∈S x ∈S 2n X Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 12 A new approach (GSS) [Geisberger et al., 2008] Example: Constant f (x) = 1/2 f (l (Psv )/l (P)) 1 1 δP (v ) := = · σst 2 σst Forward contribution of a sampled node x: XX 1 X σxy (v ) δx (v ) := {δP (v ) : P ∈ SPxy (v )} = · 2 σxy y ∈V y ∈V Algorithm by Brandes and Pich! (only difference: forward and backward searches) Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 13 A new approach (GSS) [Geisberger et al., 2008] Example: f (x) = x l (Psv )/l (P) d(s, v )/d(s, t) δP (v ) := = σst σst It can be proven that in this case: δx (v ) := X w ∈succx (v ) d(x, v ) σxv · (1 + δx (w)) d(x, w) σxw Same procedure as Brandes and Pich, but with scaled contributions Nodes close to the sampled node get less weight Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 14 A new approach (GSS) [Geisberger et al., 2008] s3 s5 s2 v s1 s4 cB (v ) = 12 · 7 + 2 · 18 + 5 · 14 = 190 δs1 (v ) = 3 · (3/4) + 3 · (3/5) + 3/6 = 4.55 δs2 (v ) = 3 · (1/2) + 3 · (1/3) + 1/4 = 2.75 c̃B (v ) = (4.55 + 2.75 + ...) · 2·20 5 = 193.47 Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 15 A new approach (GSS) [Geisberger et al., 2008] Unbiased estimator k = 1 (only 1 source node) E(c̃B (v )) = E(2n · δx (v )) (f ) δ s∈V s (v ) P = 2n · = + P t ∈V δ(t b) (v ) 2n ( l (Psv ) X X f ( l (P ) ) + 1 − f ( l l(P(Psv) ) ) s,t ∈V σst ) : P ∈ SPst (v ) X |SPst (v )| = σst s,t ∈V X σst (v )| = σst s,t ∈V = cB (v ) Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 16 RK algorithm [Riondato and Kornaropoulos, 2014] A set of r shortest paths between vertex pairs (si , ti ) i = 1, .., r is sampled c̃B (v ): fraction of sampled paths that go through v s2 s 1 + 13 + 13 + 13 t1 + 13 s3 + 13 t2 t3 each shortest path pst between s and t must be sampled with probability 1 1 P= · n(n − 1) σst Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 17 RK algorithm | Paths sampling sample a vertex pair (s, t) uniformly at random → (n(n − 1) pairs) extended SSSP from s → distances + number of shortest paths + list of predecessors starting from t, select a predecessor z with probability s σz σt z1 repeat this until we reach s every shortest path between s and t has the same probability to be sampled P(z1 ) = Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality z2 z3 t 2 , 4 P(z2 ) = 1 , 4 P(z3 ) = 1 4 18 RK algorithm [Riondato, Kornaropoulos 2014] Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 19 RK algorithm [Riondato, Kornaropoulos 2014] Unbiased estimator Given a path P, let δP (v ) = 1 if v ∈ P 0 otherwise k = 1 (only 1 sampled path) E(c̃B (v )) = E(n · (n − 1)δP (v )) = n(n − 1) X X s,t ∈V P ∈SPst s6=t 1 1 n(n − 1) σst δP (v ) X σst (v ) = σst s ,t ∈ V s6=t = cB (v ) Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 20 RK algorithm [Riondato, Kornaropoulos 2014] Absolute error guarantee Given two arbitrary numbers and δ, it is possible to prove that, if the number of sampled paths is at least r = c 2 blog2 (V D − 2)c + 1 + ln 1 δ then cB (v ) − ≤ c̃B (v ) ≤ cB (v ) + with probability at least 1 − δ VD = vertex diameter: number of nodes in the shortest path with the maximum number of nodes Unweighted graphs: same as shortest path with maximum weight, weighted graphs: unrelated Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 21 To summarize... BP algorithm is simple but the estimations tend to be biased for nodes close to a source GSS solves this problem by giving “less importance” to nodes that are close to the source RK samples single paths instead of source nodes This allows us to prove a theoretical guarantee Each sample of RK is easier to compute (can stop the SSSP from s once t is reached and does not compute dependencies) However, in practice GSS works better, because for each SSSP it uses more information Moral of the story: not always what can be proved to work well is also the best in practice :-) Elisabetta Bergamini – Approximation Algorithms for Betweenness Centrality 22