Black Box Clustering and Parallel H-LU Factorisation Ronald Kriemann Max Planck Institute for Mathematics in the Sciences Leipzig Winterschool on H-Matrices 2009 Black Box Clustering and Parallel H-LU Factorisation H Lib pro 1/34 Overview 1 Motivation 2 Graph Partitioning 3 Admissibility 4 Nested Dissection 5 Parallelisation Black Box Clustering and Parallel H-LU Factorisation 2/34 Motivation Black Box Clustering and Parallel H-LU Factorisation 3/34 Motivation Model Problem Consider −∆u = 0 in Ω = [0, 1]2 Using a uniform grid width stepwidth h and standard piecewiese linear finite elements with nodal points xi , i ∈ I, one obtains the stiffness matrix A as Black Box Clustering and Parallel H-LU Factorisation 4/34 Motivation Matrixgraph Define the matrix graph G(A) = (VA , EA ) of A ∈ RI×I as EA := I, VA := {(i, j) ∈ I × I : i 6= j ∧ aij 6= 0}, i.e. edges in the graph are defined by the sparsity pattern of the stiffness matrix. Remark Non-zero entries aij only exist in A if i and j are neighboured. For the model problem the matrix graph looks as Black Box Clustering and Parallel H-LU Factorisation 5/34 Motivation Matrixgraph Define distance dG (i, j) between nodes i, j ∈ I as length of shortest path in G(A). Then, for i, j ∈ I we have: kxi − xj k2 ≤ dG (i, j)h, i.e. distance in R2 is mapped to distance in G(A). k kxi − xj k2 = kxi − xk k2 = i √ √ j 13h, dG (i, j) = 5 5h, dG (i, k) = 3 Black Box Clustering and Parallel H-LU Factorisation 6/34 Motivation Clustering via Graph Distance Since nodes in G(A) with small distance are geometrically neighboured, one can use graph distance to cluster indices. I1 I I0 Recursively partition sub graphs for cluster tree construction. Black Box Clustering and Parallel H-LU Factorisation 7/34 Graph Partitioning Black Box Clustering and Parallel H-LU Factorisation 8/34 Graph Partitioning Requirements R Let A ∈ I×I be a sparse matrix and G = G(A) = (VA , EA ) the corresponding matrix graph. Furthermore, let diam(G) := max dG (i, j) i,j∈VA diamG (V ) := max dG (i, j), i,j∈V V ⊆ VA denote the diameter of the graph and of a sub graph, respectively. For cluster tree construction, one needs a graph partitioning algorithm with the following properties: • compact sub graphs (small diameter), • small edge-cut (small number of edges connecting sub graphs). Remark No edges between sub graphs corresponds to decoupled clusters and therefore to a block diagonal matrix. Black Box Clustering and Parallel H-LU Factorisation 9/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning Partitioning via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters 3 recurse in sub graphs Black Box Clustering and Parallel H-LU Factorisation 10/34 Graph Partitioning General Graph Partitioning for Clustering BFS based graph partitioning yields compact sub graphs, but not neccessarily minimal edge-cut, but can be improved using “Fiduccia-Mattheyses-Algorithm” (see Literature). #edge-cut: 8 Black Box Clustering and Parallel H-LU Factorisation 6 11/34 Graph Partitioning General Graph Partitioning for Clustering In graph theory, the graph partitioning problem is defined as: Given a graph G = (V, E) a partitioning P = {V1 , V2 }, with V1 ∩ V2 = ∅ and V = V1 ∪ V2 , of V is sought, such that #V1 ∼ #V2 and IG (V1 , V2 ) := #{(i, j) ∈ E : i ∈ V1 ∧ j ∈ V2 } = min Unfortunately, the graph partitioning problem is NP-hard. But good approximation algorithm exist and are implemented in open source software libraries, e.g.: • METIS, Scotch (multi-level graph partitioning), • CHACO (multi-level and spectral graph partitioning). Black Box Clustering and Parallel H-LU Factorisation 12/34 Graph Partitioning General Graph Partitioning for Clustering General black box clustering algorithm: function blackbox cluster( G = (V, E) ) if #V ≤ nmin then return cluster t := V ; else {G1 , G2 } = partition( G ); t1 := blackbox cluster( G1 ); t2 := blackbox cluster( G2 ); return cluster t := V with S(t) := {t1 , t2 }; end if end Here, partition implements the general graph partitioning algorithm, e.g. from METIS etc.. Black Box Clustering and Parallel H-LU Factorisation 13/34 Graph Partitioning 13 7 General Graph Partitioning for Clustering 14 0 15 6 13 8 92 93 116 29 95 13 5 13 9 15 4 137 12 5 15 8 30 15 5 12 6 13 4 14 3 33 12 2 13 6 15 2 158 26 14 2 5 12 4 134 14 1 157 13 2 12 3 14 7 15 9 28 5 13 1 12 8 31 13 3 132 12 7 27 14 6 3 18 72 74 12 9 73 83 56 76 78 81 43 1 7 41 46 9 49 77 49 9 45 44 48 46 61 57 78 76 42 50 39 8 54 1 10 47 52 48 51 52 57 40 53 55 10 44 104 11 62 61 77 16 75 17 79 58 60 73 16 144 129 47 60 62 43 106 59 12 81 17 82 127 50 7 105 88 79 74 128 72 110 108 2 70 3 146 109 23 86 83 11 10 4 84 87 65 80 13 0 14 4 70 15 19 145 148 133 111 119 99 15 28 24 107 63 69 131 124 113 121 85 13 64 71 31 120 4 103 14 150 26 14 8 14 5 69 71 159 147 75 88 58 18 100 67 112 117 101 149 68 122 22 66 32 161 143 142 25 91 151 125 136 98 0 160 30 115 89 102 153 139 27 80 96 33 141 63 65 152 135 130 2 12 86 59 84 10 8 10 6 118 21 140 126 15 0 14 98 66 68 10 0 13 64 10 1 85 19 87 10 7 23 10 5 114 90 97 123 10 3 99 29 155 15 7 16 0 15 3 32 15 1 67 0 14 9 94 97 96 21 10 2 22 91 4 94 154 16 1 92 93 95 11 6 20 90 11 4 11 8 89 11 5 25 11 7 11 2 12 0 12 1 11 3 24 11 9 11 1 10 9 11 0 20 156 138 35 6 51 36 38 37 34 34 36 38 35 6 40 39 53 42 55 8 41 56 45 54 82 37 BFS (#IG = 21) BFS+FM (#IG = 11) 92 92 4 11 114 20 8 11 95 5 11 20 115 94 89 21 97 91 2 10 96 4 15 98 6 15 22 53 0 14 9 13 30 5 12 3 14 2 14 6 12 2 13 26 1 13 1 14 2 12 4 12 28 3 13 3 12 27 8 12 7 12 50 55 52 10 50 35 51 55 52 36 6 37 49 6 75 35 53 129 49 40 54 47 130 5 13 4 13 6 13 5 7 14 31 8 14 6 14 4 14 75 47 40 9 127 54 8 48 76 128 10 16 123 9 12 9 0 13 8 48 39 44 46 76 16 46 39 78 144 29 2 15 1 16 9 15 70 5 14 3 74 78 45 1 73 72 73 1 42 45 72 7 15 14 71 0 15 69 79 80 81 61 17 42 43 77 146 44 148 77 43 7 17 131 133 33 32 64 65 15 59 12 7 132 8 15 3 15 1 15 67 13 68 63 2 62 41 41 61 8 13 5 15 0 10 0 66 99 87 88 60 57 62 81 74 28 27 7 13 1 10 3 10 19 86 57 56 12 80 141 7 11 11 60 70 26 4 58 79 69 145 3 124 84 11 150 126 7 10 85 82 59 15 159 122 18 58 71 161 147 82 65 14 157 5 31 25 3 11 83 149 134 136 142 23 2 83 32 135 30 143 104 4 10 88 64 56 140 5 10 18 63 68 160 139 8 10 106 6 10 86 87 67 33 125 105 108 99 13 151 9 11 23 84 19 66 152 153 1 12 9 10 85 103 100 0 155 110 0 11 107 22 98 154 109 119 101 137 158 0 12 1 11 24 121 4 96 156 111 24 91 102 29 113 120 117 97 138 90 2 11 112 25 89 0 16 118 90 21 9 14 93 95 94 93 6 11 116 51 34 38 38 36 37 34 METIS (#IG = 12) Black Box Clustering and Parallel H-LU Factorisation Scotch (#IG = 12) 14/34 Admissibility Black Box Clustering and Parallel H-LU Factorisation 15/34 Admissibility Prerequisites Standard admissibility is defined by min(diam(Ωt ), diam(Ωs )) ≤ η dist(Ωt , Ωs ) with support Ωi for each cluster i and, hence, uses unavailable geometrical data. Distance in Graphs For V1 , V2 ⊂ V , the distance between V1 and V2 is defined as distG (V1 , V2 ) := min i∈V1 ,j∈V2 distG (i, j) with dist(i, j) := length of shortest path between i and j in G. Black Box Clustering and Parallel H-LU Factorisation 16/34 Admissibility Weak Admissibility The simplest admissibility condition for a block cluster (t, s) is defined by ( true, if distG (t, s) > 1 admweak (t, s) := , false, otherwise e.g. if no edge is connecting t and s in G. t3 t2 t1 admweak (t1 , t2 ) = true admweak (t1 , t3 ) = false Weak admissibility is cheap to test and produces effective partitions for H-arithmetics (see experiments). Black Box Clustering and Parallel H-LU Factorisation 17/34 Admissibility Standard Admissibility The standard admissibility is defined by ( true, min(diamG (t), diamG (s)) ≤ η distG (t, s) admstd (t, s) := , false, otherwise e.g. the equivalent of the geometrical admissibility. Since diameter and distance between clusters in G costs O n2 , the admissibility is tested as: • choose node i ∈ t and j ∈ t with distG (i, j) = max, t3 t2 ] • diamG (t) ≤ 2 distG (i, j) =: diam, • construct surrounding t0 around t ] in G via η1 diam. t1 • if t0 ∩ s = ∅, admstd (t, s) = true. Black Box Clustering and Parallel H-LU Factorisation 18/34 Admissibility Standard Admissibility The standard admissibility is defined by ( true, min(diamG (t), diamG (s)) ≤ η distG (t, s) admstd (t, s) := , false, otherwise e.g. the equivalent of the geometrical admissibility. Since diameter and distance between clusters in G costs O n2 , the admissibility is tested as: • choose node i ∈ t and j ∈ t with distG (i, j) = max, t3 t2 ] • diamG (t) ≤ 2 distG (i, j) =: diam, • construct surrounding t0 around t ] in G via η1 diam. t1 • if t0 ∩ s = ∅, admstd (t, s) = true. Black Box Clustering and Parallel H-LU Factorisation 18/34 Admissibility Standard Admissibility The standard admissibility is defined by ( true, min(diamG (t), diamG (s)) ≤ η distG (t, s) admstd (t, s) := , false, otherwise e.g. the equivalent of the geometrical admissibility. Since diameter and distance between clusters in G costs O n2 , the admissibility is tested as: • choose node i ∈ t and j ∈ t with distG (i, j) = max, t3 t2 ] • diamG (t) ≤ 2 distG (i, j) =: diam, • construct surrounding t0 around t ] in G via η1 diam. t1 • if t0 ∩ s = ∅, admstd (t, s) = true. Black Box Clustering and Parallel H-LU Factorisation 18/34 Admissibility Standard Admissibility The standard admissibility is defined by ( true, min(diamG (t), diamG (s)) ≤ η distG (t, s) admstd (t, s) := , false, otherwise e.g. the equivalent of the geometrical admissibility. Since diameter and distance between clusters in G costs O n2 , the admissibility is tested as: • choose node i ∈ t and j ∈ t with distG (i, j) = max, t3 t2 ] • diamG (t) ≤ 2 distG (i, j) =: diam, • construct surrounding t0 around t ] in G via η1 diam. t1 • if t0 ∩ s = ∅, admstd (t, s) = true. Black Box Clustering and Parallel H-LU Factorisation 18/34 Admissibility Numerical Examples H-LU factorisation of Model Problem: N Geometric Black Box Time Mem δ Time Mem δ 2532 3582 5112 7292 10232 403 513 643 813 1023 (sec) (MB) 3.8 10.0 24.1 61.1 144.9 79.1 194.5 520.3 1440.0 3875.5 76 169 374 840 1780 285 634 1400 3560 8070 210 -4 110 -4 710 -5 410 -5 210 -5 110 -3 110 -3 110 -3 510 -4 410 -4 (sec) (MB) 6.6 15.7 41.7 116.1 250.8 106.5 326.1 896.4 2444.8 6575.7 86 187 441 1020 2110 292 763 1760 4330 9940 110 -4 610 -5 310 -5 110 -5 810 -6 110 -3 710 -4 410 -4 210 -4 210 -4 Accuracy of H-arithmetics defined by δ and chosen such that kI − (LH UH )−1 Ak2 ≤ 10−4 Black Box Clustering and Parallel H-LU Factorisation 19/34 Nested Dissection Black Box Clustering and Parallel H-LU Factorisation 20/34 Nested Dissection Vertex Separator In nested dissection the two constructed sub graphs of a partition have to be separated via a vertex separator. Matrix graph: Matrix: Especially suited are graph partitioning algorithms yielding minimal edge-cut, therefore, maximizing the size of the zero off-diagonal matrix blocks. Black Box Clustering and Parallel H-LU Factorisation 21/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; • V := V ∪ {v}; V 0 := V 0 \ {v}; • E := E \ {(i, j 0 ) ∈ E}; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; • V := V ∪ {v}; V 0 := V 0 \ {v}; • E := E \ {(i, j 0 ) ∈ E}; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; • V := V ∪ {v}; V 0 := V 0 \ {v}; • E := E \ {(i, j 0 ) ∈ E}; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; • V := V ∪ {v}; V 0 := V 0 \ {v}; • E := E \ {(i, j 0 ) ∈ E}; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Constructing the Vertex Separator Let V1 , V2 ⊂ V, V1 ∩ V2 = ∅ be a partition of G = (V, E) and let E = {(i, j) ∈ E : i ∈ V1 , j ∈ V2 } be the edge-cut of V1 , V2 . A vertex separator V for V1 , V2 can be obtained by computing a vertex cover of E, i.e. a set of nodes incident to all edges in E. Algorithm: Loop until E 6= ∅: • choose (i, j) ∈ E; • choose v ∈ {i, j} such that v ∈ V 0 with #V 0 = max Vi ; • V := V ∪ {v}; V 0 := V 0 \ {v}; • E := E \ {(i, j 0 ) ∈ E}; Black Box Clustering and Parallel H-LU Factorisation 22/34 Nested Dissection Subdividing the Vertex Separator In contrast to classical nested dissection, H-matrices also use a cluster tree for indices in the vertex seperator. Hence, further subdivision is necessary. Problem: restricting G to nodes in V might remove important edges, e.g. G|V Therefore, graph partitioning for vertex separator is performed in sub graph induced by V1 , V2 and V. Black Box Clustering and Parallel H-LU Factorisation 23/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, • perform BFS step only for smaller node set to achieve balance, Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, • perform BFS step only for smaller node set to achieve balance, Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, • perform BFS step only for smaller node set to achieve balance, Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, • perform BFS step only for smaller node set to achieve balance, • stop BFS iteration when all nodes in V have been visited. Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Subdividing the Vertex Separator Modify BFS based algorithm for vertex separator: • choose start nodes for BFS in V, • perform BFS step only for smaller node set to achieve balance, • stop BFS iteration when all nodes in V have been visited. For further subdivision, only consider visited nodes to reduce complexity. Remark Still open: efficient construction of minimal surrounding graph for subdivision of vertex separator. Unfortunately, no graph partitioning packages, e.g. METIS, Scotch, etc., applicable to vertex separator partitioning. Black Box Clustering and Parallel H-LU Factorisation 24/34 Nested Dissection Numerical Experiments H-LU factorisation of Model Problem using nested dissection: N 2532 3582 5112 7292 10232 403 513 643 813 1023 Geometric Time Mem δ Black Box Time Mem δ (sec) (MB) (sec) (MB) 0.9 1.9 4.5 9.6 20.2 12.6 46.9 117.4 269.8 752.3 51 86 212 371 878 99 300 592 1410 3020 1.3 2.9 6.5 15.0 31.6 32.7 97.6 289.1 804.3 1907.3 47 94 198 402 819 135 323 719 1570 3370 110 -3 410 -4 210 -4 110 -4 610 -5 110 -2 310 -3 210 -3 110 -3 110 -3 310 -5 210 -5 910 -6 510 -6 210 -6 310 -4 210 -4 110 -4 810 -5 610 -5 Again, H-accuracy δ chosen such that kI − (LH UH )−1 Ak2 ≤ 10−4 Black Box Clustering and Parallel H-LU Factorisation 25/34 Nested Dissection Numerical Experiments Comparison of algebraic H-LU factorisation with direct solvers for in Ω = [0, 1]2 −∆u + λu = f 700 Time for setup in sec. 600 500 H-Matrix Pardiso MUMPS UMFPACK SuperLU Spooles 400 300 200 100 0 5.0 . 105 1.0 . 106 1.5 . 106 2.0 . 106 No. of Unknowns Black Box Clustering and Parallel H-LU Factorisation 26/34 Nested Dissection Numerical Experiments Comparison of algebraic H-LU factorisation with direct solvers for in Ω = [0, 1]3 −∆u + λu = f H-Matrix Pardiso MUMPS UMFPACK SuperLU Spooles 18000 Time for setup in sec. 16000 14000 12000 10000 8000 6000 4000 2000 0 5.0 . 105 1.0 . 106 1.5 . 106 2.0 . 106 No. of Unknowns Black Box Clustering and Parallel H-LU Factorisation 27/34 Parallelisation Black Box Clustering and Parallel H-LU Factorisation 28/34 Parallelisation Direct Domain Decomposition Graph G is partitioned into p sub graphs decoupled by single vertex separator: A04 A00 A14 A11 A24 A22 Black Box Clustering and Parallel H-LU Factorisation A41 A42 A34 A33 A40 A43 A44 29/34 Parallelisation Direct Domain Decomposition Graph G is partitioned into p sub graphs decoupled by single vertex separator: A04 A00 A14 A11 A24 A22 A41 A42 A34 A33 A40 A43 A44 Parallel H-LU factorisation on processor i: 1 factorise Aii = Lii Uii , solve Aip = Lii Uip and Api = Lpi Uii , 3 compute and exchange Lpi Uip , P 4 update App = App − i Lpi Uip , 5 factorise App = Lpp Lpp 2 Black Box Clustering and Parallel H-LU Factorisation (seq. LU Fac.) (seq. Algo.) (log p steps) (seq. Matrix Mult.) (seq. LU Fac.) 29/34 Parallelisation Direct Domain Decomposition For the complexity of the parallel H-LU factorisation in the model problem, we assume • equal load of order n/p per sub graph, • sizes nV of vertex separator is of optimal order p1/d n(d−1)/d Then one obtains: 12 + 1/d (d−1)/d p n n = 20472 3 n = 102 10 2 log n log p The speedup is limited by size of vertex separator, which increases with p. Speedup n log2 n O p 8 6 4 2 5 10 15 20 25 30 No. of Processors Black Box Clustering and Parallel H-LU Factorisation 30/34 Parallelisation Nested Dissection Graph G is hierarchically partitioned with local vertex separators: Black Box Clustering and Parallel H-LU Factorisation 31/34 Parallelisation Nested Dissection Graph G is hierarchically partitioned with local vertex separators: A02 A00 A12 A20 A11 A21 A22 Parallel H-LU factorisation is based on algorithm for direct domain decomposition with p = 2: 1 2 3 4 5 6 choose i ∈ {0, 1} such that Aii is on local processor; factorise Aii = Lii Uii , (Recursion) solve Ai2 = Lii Ui2 and A2i = L2i Uii , (parallel Matrix Mult.) compute and exchange PL2i Ui2 , update A22 = A22 − i L2i Ui2 , (seq. Matrix Mult.) factorise A22 = L22 L22 (seq. LU Fac.) Black Box Clustering and Parallel H-LU Factorisation 31/34 Parallelisation Nested Dissection Data distribution on to P := {1, . . . , p} processors follows hierarchical decomposition during nested dissection: • on level 0, all processors handle the matrix, {1, . . . , 4} Black Box Clustering and Parallel H-LU Factorisation 32/34 Parallelisation Nested Dissection Data distribution on to P := {1, . . . , p} processors follows hierarchical decomposition during nested dissection: • on level 0, all processors handle the matrix, • on level 1, P is split into two halves according to graph bisection, {1, 2} {3, 4} P Black Box Clustering and Parallel H-LU Factorisation 32/34 Parallelisation Nested Dissection Data distribution on to P := {1, . . . , p} processors follows hierarchical decomposition during nested dissection: {1} {2} {1, 2} {3} {4} {3, 4} P Black Box Clustering and Parallel H-LU Factorisation • on level 0, all processors handle the matrix, • on level 1, P is split into two halves according to graph bisection, • recursively divide the processor set. 32/34 Parallelisation Nested Dissection Data distribution on to P := {1, . . . , p} processors follows hierarchical decomposition during nested dissection: {1} {2} {1, 2} {3} {4} {3, 4} P • on level 0, all processors handle the matrix, • on level 1, P is split into two halves according to graph bisection, • recursively divide the processor set. For processor i: • only handle those matrices with processor set P, if i ∈ P, • exchange data only with other processors in P. Black Box Clustering and Parallel H-LU Factorisation 32/34 Parallelisation Nested Dissection For the complexity of the parallel H-LU factorisation in the model problem, we again assume • equal load of order n/p per sub graph, • minimal order w.r.t. dimension d of local vertex separator Then one obtains: n log2 n O p + (d−1)/d 2 n = 28952 3 n = 128 25 log n log p The speedup is now limited by size O n(d−1)/d of first vertex separator and much less dependent on p. Speedup n 30 20 15 10 5 5 10 15 20 25 30 No. of Processors Black Box Clustering and Parallel H-LU Factorisation 33/34 Literature L. Grasedyck, R. Kriemann and S. Le Borne, Domain Decomposition Based H-LU Preconditioning, to appear in “Numerische Mathematik”. L. Grasedyck, R. Kriemann and S. Le Borne, Parallel Black Box H-LU Preconditioning for Elliptic Boundary Value Problems, “Computing and Visualization in Science”, 11(4-6), pp. 273–291, 2008. G. Karypis and V. Kumar A fast and high quality multilevel scheme for partitioning irregular graphs, “SIAM Journal on Scientific Computing”, 20(1), pp. 359–392, 1999. C.M. Fiduccia and R.M. Mattheyses, A linear-time heuristic for improving network partitions, In “Proceedings of the 19th Design Automation Conference”, pp. 175–181, IEEE, 1982. Black Box Clustering and Parallel H-LU Factorisation 34/34