Parallel Black Box H-LU Preconditioning Ronald Kriemann MPI MIS Leipzig Workshop “Complex Systems” METU 2009-05-14/15 H Lib pro Overview 1 H-Matrices 2 Algebraic Clustering 3 Algebraic Admissibility 4 Nested Dissection 5 Numerical Experiments Parallel Black Box H-LU Preconditioning 2/26 H-Matrices Parallel Black Box H-LU Preconditioning 3/26 H-Matrices Model Problem Uniformly elliptic 2nd order PDE div α(x) ∇u(x) = f (x), x∈Ω with Dirichlet/Neumann boundary conditions. Galerkin discretisation Aij = h∇ϕi , α∇ϕj iL2 (Ω) Ax = b, with basis functions ϕi : Ω → R, i ∈ I = {1, . . . , N } Goal: Solve the system fast and robust using LU factorisation of A as preconditioner. Problem: LU factors are usually prohibitively dense. Solution: Compute approximate LU factorisation using H-matrices with (almost) linear complexity. Parallel Black Box H-LU Preconditioning 4/26 H-Matrices What are H-Matrices? R A matrix M ∈ n×m of rank ≤ k can be represented as M = UV T , U∈ Rn×k , V ∈ Rm×k VT U I R(k)-matrix format Parallel Black Box H-LU Preconditioning 5/26 H-Matrices What are H-Matrices? R A matrix M ∈ n×m of rank ≤ k can be represented as M = UV T , U∈ Rn×k , V ∈ Rm×k VT U I R(k)-matrix format For a block-wise low-rank matrix M ∈ • each block is R(k)-matrix Rn×m • for small blocks: fullmatrix format I H-matrix format with hierarchically block organisation. Needed: reordering (clustering) of index sets to allow low-rank representation. Parallel Black Box H-LU Preconditioning 5/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: Matrix Construct block cluster tree Parallel Black Box H-LU Preconditioning 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: cluster Matrix Construct block cluster tree Parallel Black Box H-LU Preconditioning 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: cluster Matrix Construct block cluster tree Parallel Black Box H-LU Preconditioning 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: Matrix Construct block cluster tree Parallel Black Box H-LU Preconditioning 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: Ωt Ωs Matrix Construct block cluster tree with admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s), Parallel Black Box H-LU Preconditioning η>0 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: Ωt Ωs Matrix Construct block cluster tree with admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s), Parallel Black Box H-LU Preconditioning η>0 6/26 H-Matrices Clustering Domain Construct cluster tree using geometrical data: Matrix Construct block cluster tree with admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s), Parallel Black Box H-LU Preconditioning η>0 6/26 H-Matrices Matrix Structure • O (n) blocks Parallel Black Box H-LU Preconditioning 7/26 H-Matrices Matrix Structure • O (n) blocks • Small red blocks: full matrices Parallel Black Box H-LU Preconditioning 7/26 H-Matrices Matrix Structure • O (n) blocks • Small red blocks: full matrices • All other blocks: R(k)-matrices Parallel Black Box H-LU Preconditioning 7/26 H-Matrices 16 7 7 7 7 6 7 16 8 9 7 7 7 8 16 9 7 9 10 16 7 7 6 10 7 9 11 11 10 7 8 10 10 9 9 16 8 10 8 8 16 9 8 10 10 9 7 9 8 9 8 9 8 8 12 12 12 9 7 8 7 10 6 9 10 10 16 10 7 9 11 7 10 9 11 10 10 13 10 16 10 10 10 10 10 16 10 8 10 9 10 10 16 8 9 9 10 8 9 10 9 16 8 10 8 13 7 6 10 7 7 16 10 10 10 16 7 7 6 7 10 11 12 8 10 9 7 7 10 8 9 8 16 7 7 7 16 10 7 10 16 12 11 10 10 10 10 9 10 13 10 9 7 10 9 16 10 10 10 9 8 10 10 10 16 9 11 10 9 10 10 8 16 10 10 10 10 16 10 8 9 10 9 8 8 7 16 8 11 10 11 10 11 11 11 15 11 10 8 11 10 8 16 10 10 9 11 10 16 9 8 10 10 10 9 16 10 8 10 16 8 9 8 8 9 16 10 8 9 10 10 16 8 10 10 10 12 13 12 8 9 10 10 16 10 10 8 10 8 10 15 13 16 8 11 8 7 10 9 8 10 16 9 16 10 8 16 9 9 10 9 16 7 7 7 8 10 7 10 10 10 16 11 10 10 10 11 Parallel Black Box H-LU Preconditioning 8 11 exponential decay of singular values 7 13 16 9 7 6 8 7 10 16 9 8 10 9 6 7 9 12 8 10 9 10 11 6 13 16 10 10 9 9 7 11 9 12 11 10 16 8 10 8 16 9 7 10 9 10 10 7 7 8 7 11 16 10 10 10 10 8 10 16 10 10 10 8 9 10 10 10 16 10 8 9 10 10 10 16 8 12 16 13 10 8 10 10 9 10 13 11 8 8 9 16 7 9 10 10 10 10 11 10 9 7 8 10 10 7 7 12 12 10 8 9 9 10 8 7 11 12 12 10 9 12 12 7 13 12 9 10 9 10 11 12 11 11 10 10 9 8 13 11 10 10 8 12 16 11 10 9 11 10 11 12 13 11 11 10 8 10 10 12 10 10 16 10 10 10 11 • block-wise: 13 12 10 10 8 16 10 10 10 12 13 15 8 10 10 12 13 16 13 12 10 10 9 10 11 16 13 12 8 12 12 10 12 12 10 10 10 12 12 8 11 7 11 10 9 12 13 9 9 11 10 8 7 8 8 8 10 9 9 16 7 9 9 9 10 10 11 15 11 10 12 7 8 16 8 10 11 8 8 10 9 13 16 8 10 9 8 12 16 7 11 16 10 10 9 11 9 7 8 13 10 8 10 10 9 9 11 13 8 16 10 12 13 11 7 10 10 9 10 8 7 10 9 8 10 10 8 11 11 10 16 9 12 11 10 12 10 10 16 10 11 11 6 8 10 16 10 16 10 8 10 11 Matrix Structure 9 10 7 9 10 10 10 8 8 9 11 8 10 8 10 16 9 16 10 8 10 16 10 13 9 9 8 10 10 8 7 7 10 10 12 11 11 13 10 8 16 10 10 9 10 12 12 11 10 8 10 10 9 8 7 11 10 9 16 10 10 9 11 9 11 10 16 9 13 11 6 11 13 12 10 9 9 10 9 9 10 7 9 8 10 11 9 8 9 7 10 10 9 16 8 10 8 8 9 16 10 11 9 7 6 8 7 16 10 10 7 10 10 16 8 7 7 8 10 8 16 7 6 7 7 7 7 16 7/26 H-Matrices Arithmetic Due to hierarchical block structure, standard recursive block algorithms can be used, e.g. for multiplication: B11 B12 A11 · B11 + A12 · B21 A11 · B12 + A12 · B22 A11 A12 · = B21 B22 A21 · B11 + A22 · B21 A21 · B12 + A22 · B22 A21 A22 Parallel Black Box H-LU Preconditioning 8/26 H-Matrices Arithmetic Due to hierarchical block structure, standard recursive block algorithms can be used, e.g. for multiplication: B11 B12 A11 · B11 + A12 · B21 A11 · B12 + A12 · B22 A11 A12 · = B21 B22 A21 · B11 + A22 · B21 A21 · B12 + A22 · B22 A21 A22 But, addition of low-rank matrices increases the rank and finally produces full-rank matrices. To limit complexity, a truncated addition is performed using SVD: A1 B1T + A2 B2T =: CDT → U SV T → C 0 D0T with (predefined) rank(C 0 D0T ) < rank(CDT ) . Parallel Black Box H-LU Preconditioning 8/26 H-Matrices Arithmetic Due to hierarchical block structure, standard recursive block algorithms can be used, e.g. for multiplication: B11 B12 A11 · B11 + A12 · B21 A11 · B12 + A12 · B22 A11 A12 · = B21 B22 A21 · B11 + A22 · B21 A21 · B12 + A22 · B22 A21 A22 But, addition of low-rank matrices increases the rank and finally produces full-rank matrices. To limit complexity, a truncated addition is performed using SVD: A1 B1T + A2 B2T =: CDT → U SV T → C 0 D0T with (predefined) rank(C 0 D0T ) < rank(CDT ) . Complexity truncation storage matrix × vector addition O (n) O (n log n) O (n log n) O (n log n) Parallel Black Box H-LU Preconditioning multiplication inversion triangular solve LU decomposition O O O O n log2 n n log2 n n log2 n n log2 n 8/26 H-Matrices Summary To solve Ax = b using H-LU factorisation: 1 construct cluster tree using geometrical data, 2 construct block cluster tree using admissibility condition (based on geometrical data), 3 build H-matrix representation of A, 4 perform H-LU factorisation (with approximation due to truncated addition), 5 solve Ax = b preconditioned with H-LU approximated A−1 . Parallel Black Box H-LU Preconditioning 9/26 H-Matrices Summary To solve Ax = b using H-LU factorisation: 1 construct cluster tree using geometrical data, 2 construct block cluster tree using admissibility condition (based on geometrical data), 3 build H-matrix representation of A, 4 perform H-LU factorisation (with approximation due to truncated addition), 5 solve Ax = b preconditioned with H-LU approximated A−1 . But what to do if no geometry information is available? Parallel Black Box H-LU Preconditioning 9/26 Algebraic Clustering Parallel Black Box H-LU Preconditioning 10/26 Algebraic Clustering Motivation Consider −∆u = 0 in Ω = [0, 1]2 Using a uniform grid with step width h and standard piecewise linear finite elements with nodal points xi , i ∈ I, one obtains the stiffness matrix A ∈ I×I as R Parallel Black Box H-LU Preconditioning 11/26 Algebraic Clustering Motivation Consider −∆u = 0 in Ω = [0, 1]2 Using a uniform grid with step width h and standard piecewise linear finite elements with nodal points xi , i ∈ I, one obtains the stiffness matrix A ∈ I×I as R Define the matrix graph G(A) = (VA , EA ) of A as VA := I, EA := {(i, j) : i 6= j ∧ aij 6= 0}, i.e. graph corresponds to sparsity pattern of stiffness matrix. Parallel Black Box H-LU Preconditioning 11/26 Algebraic Clustering Motivation Consider −∆u = 0 in Ω = [0, 1]2 Using a uniform grid with step width h and standard piecewise linear finite elements with nodal points xi , i ∈ I, one obtains the stiffness matrix A ∈ I×I as R Define the matrix graph G(A) = (VA , EA ) of A as VA := I, EA := {(i, j) : i 6= j ∧ aij 6= 0}, i.e. graph corresponds to sparsity pattern of stiffness matrix. Parallel Black Box H-LU Preconditioning 11/26 Algebraic Clustering Motivation Define distance distG (i, j) between nodes i, j ∈ I as length of shortest path in G(A). Then, for i, j ∈ I we have: kxi − xj k2 ≤ distG (i, j)h, i.e. distance in k R2 is mapped to distance in G(A): j kxi − xj k2 = kxi − xk k2 = √ √ 13h, distG (i, j) = 5 5h, distG (i, k) = 3 i Parallel Black Box H-LU Preconditioning 12/26 Algebraic Clustering Motivation Define distance distG (i, j) between nodes i, j ∈ I as length of shortest path in G(A). Then, for i, j ∈ I we have: kxi − xj k2 ≤ distG (i, j)h, i.e. distance in k R2 is mapped to distance in G(A): j kxi − xj k2 = kxi − xk k2 = √ √ 13h, distG (i, j) = 5 5h, distG (i, k) = 3 i In model problem: since nodes in G(A) with small distance are also geometrically neighboured, one can use graph distance to cluster indices. Parallel Black Box H-LU Preconditioning 12/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via Breadth First Search Algorithm: 1 determine two nodes i, j ∈ VA with (almost) maximal distance, 2 perform simultaneous BFS from i and j to construct sub clusters: • per step, add unvisited neighbours of nodes in sub clusters 3 recurse in sub graphs Parallel Black Box H-LU Preconditioning 13/26 Algebraic Clustering Clustering via General Graph Partitioning In graph theory, the graph partitioning problem is defined as: Given a graph G = (V, E) a partitioning P = {V1 , V2 }, with V1 ∩ V2 = ∅ and V = V1 ∪ V2 , of V is sought, such that #V1 ∼ #V2 and #{(i, j) ∈ E : i ∈ V1 ∧ j ∈ V2 } = min . (edge-cut) A small edge-cut corresponds to a low-rank coupling of matrix blocks. Parallel Black Box H-LU Preconditioning 14/26 Algebraic Clustering Clustering via General Graph Partitioning In graph theory, the graph partitioning problem is defined as: Given a graph G = (V, E) a partitioning P = {V1 , V2 }, with V1 ∩ V2 = ∅ and V = V1 ∪ V2 , of V is sought, such that #V1 ∼ #V2 and #{(i, j) ∈ E : i ∈ V1 ∧ j ∈ V2 } = min . (edge-cut) A small edge-cut corresponds to a low-rank coupling of matrix blocks. Although the graph partitioning problem is NP-hard good approximation algorithms exist, e.g. multilevel or spectral methods. Furthermore, they are available in open source packages, e.g. METIS, Chaco or Scotch. Parallel Black Box H-LU Preconditioning 14/26 Algebraic Admissibility Parallel Black Box H-LU Preconditioning 15/26 Algebraic Admissibility Definition To apply the standard admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s) for a block cluster (t, s) ∈ V × V , one needs to define distance and diameter of clusters in a graph. Parallel Black Box H-LU Preconditioning 16/26 Algebraic Admissibility Definition To apply the standard admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s) for a block cluster (t, s) ∈ V × V , one needs to define distance and diameter of clusters in a graph. • For V1 , V2 ⊂ V , the distance between V1 and V2 is defined as distG (V1 , V2 ) := min i∈V1 ,j∈V2 distG (i, j). • The diameter of a sub graph induced by V 0 ⊆ V is defined as diamG (V 0 ) := max0 distG (i, j). i,j∈V Parallel Black Box H-LU Preconditioning 16/26 Algebraic Admissibility Definition To apply the standard admissibility condition min(diam(t), diam(s)) ≤ η dist(t, s) for a block cluster (t, s) ∈ V × V , one needs to define distance and diameter of clusters in a graph. • For V1 , V2 ⊂ V , the distance between V1 and V2 is defined as distG (V1 , V2 ) := min i∈V1 ,j∈V2 distG (i, j). • The diameter of a sub graph induced by V 0 ⊆ V is defined as diamG (V 0 ) := max0 distG (i, j). i,j∈V Problem: diameter and distance in G costs O n2 . Parallel Black Box H-LU Preconditioning 16/26 Algebraic Admissibility Testing Solution: approximate cluster diameter and construct cluster surrounding ensuring admissibility. For testing admissibility of block cluster (t, s) ∈ V × V • choose i ∈ t and compute j ∈ t with distG (i, j) = max, s0 s ] • diamG (t) ≤ 2 distG (i, j) =: diam, • build surrounding t̃ around t with layers, 1] η diam t • if t̃ ∩ s = ∅ then (t, s) is admissible. Parallel Black Box H-LU Preconditioning 17/26 Algebraic Admissibility Testing Solution: approximate cluster diameter and construct cluster surrounding ensuring admissibility. For testing admissibility of block cluster (t, s) ∈ V × V • choose i ∈ t and compute j ∈ t with distG (i, j) = max, s0 s ] • diamG (t) ≤ 2 distG (i, j) =: diam, • build surrounding t̃ around t with layers, 1] η diam t • if t̃ ∩ s = ∅ then (t, s) is admissible. Parallel Black Box H-LU Preconditioning 17/26 Algebraic Admissibility Testing Solution: approximate cluster diameter and construct cluster surrounding ensuring admissibility. For testing admissibility of block cluster (t, s) ∈ V × V • choose i ∈ t and compute j ∈ t with distG (i, j) = max, s0 s ] • diamG (t) ≤ 2 distG (i, j) =: diam, • build surrounding t̃ around t with layers, 1] η diam t • if t̃ ∩ s = ∅ then (t, s) is admissible. Parallel Black Box H-LU Preconditioning 17/26 Algebraic Admissibility Testing Solution: approximate cluster diameter and construct cluster surrounding ensuring admissibility. For testing admissibility of block cluster (t, s) ∈ V × V • choose i ∈ t and compute j ∈ t with distG (i, j) = max, s0 s ] • diamG (t) ≤ 2 distG (i, j) =: diam, • build surrounding t̃ around t with layers, 1] η diam t • if t̃ ∩ s = ∅ then (t, s) is admissible. Parallel Black Box H-LU Preconditioning 17/26 Algebraic Admissibility Testing Solution: approximate cluster diameter and construct cluster surrounding ensuring admissibility. For testing admissibility of block cluster (t, s) ∈ V × V • choose i ∈ t and compute j ∈ t with distG (i, j) = max, s0 s ] • diamG (t) ≤ 2 distG (i, j) =: diam, • build surrounding t̃ around t with layers, 1] η diam t • if t̃ ∩ s = ∅ then (t, s) is admissible. With usual FEM sparsity patterns, this procedure has complexity O (#t) . Parallel Black Box H-LU Preconditioning 17/26 Nested Dissection Parallel Black Box H-LU Preconditioning 18/26 Nested Dissection Vertex Separator In nested dissection the two constructed sub graphs of a partition have to be separated by a (minimal) vertex separator. Matrix graph: Matrix: Parallel Black Box H-LU Preconditioning 19/26 Nested Dissection Vertex Separator In nested dissection the two constructed sub graphs of a partition have to be separated by a (minimal) vertex separator. Matrix graph: Matrix: Parallel Black Box H-LU Preconditioning 19/26 Nested Dissection Vertex Separator In nested dissection the two constructed sub graphs of a partition have to be separated by a (minimal) vertex separator. Matrix graph: Matrix: Advantages of Nested Dissection • zero blocks do not fill up during H-LU factorisation, • blocks can be computed in parallel. Parallel Black Box H-LU Preconditioning 19/26 Nested Dissection Cluster Tree for the Vertex Separator A vertex separator can be obtained by computing a vertex cover of the edge-cut between both node sets in a partition. But for H-matrices the vertex separator has to be further partitioned to form a cluster tree. Parallel Black Box H-LU Preconditioning 20/26 Nested Dissection Cluster Tree for the Vertex Separator A vertex separator can be obtained by computing a vertex cover of the edge-cut between both node sets in a partition. But for H-matrices the vertex separator has to be further partitioned to form a cluster tree. Problem: restricting G to nodes in vertex separator V might remove important edges, e.g. G|V Parallel Black Box H-LU Preconditioning 20/26 Nested Dissection Cluster Tree for the Vertex Separator A vertex separator can be obtained by computing a vertex cover of the edge-cut between both node sets in a partition. But for H-matrices the vertex separator has to be further partitioned to form a cluster tree. Problem: restricting G to nodes in vertex separator V might remove important edges, e.g. Solution: modify previous BFS based algorithm to perform partitioning in a surrounding of the vertex separator. Parallel Black Box H-LU Preconditioning 20/26 Numerical Experiments Parallel Black Box H-LU Preconditioning 21/26 Numerical Experiments Comparison with Geom. Clustering Solving model problem: N 2532 3582 5112 7292 10232 403 513 643 813 1023 Geometric Time (s) Mem (MB) 0.9 51 1.9 86 4.5 212 9.6 371 20.2 878 12.6 99 46.9 300 117.4 592 269.8 1410 752.3 3020 Algebraic Time (s) Mem (MB) 1.3 47 2.9 94 6.5 198 15.0 402 31.6 819 32.7 135 97.6 323 289.1 719 804.3 1570 1907.3 3370 Accuracy of H-arithmetic chosen such that kI − (LH UH )−1 Ak2 ≤ 10−4 Parallel Black Box H-LU Preconditioning 22/26 Numerical Experiments Comparison with Direct Solvers Solving in Ω = [0, 1]2 −∆u + λu = f 700 Time for setup in sec. 600 500 H-Matrix Pardiso MUMPS UMFPACK SuperLU Spooles 400 300 200 100 0 5.0 . 105 1.0 . 106 1.5 . 106 2.0 . 106 No. of Unknowns Parallel Black Box H-LU Preconditioning 23/26 Numerical Experiments Comparison with Direct Solvers Solving in Ω = [0, 1]3 −∆u + λu = f H-Matrix Pardiso MUMPS UMFPACK SuperLU Spooles 18000 Time for setup in sec. 16000 14000 12000 10000 8000 6000 4000 2000 0 5.0 . 105 1.0 . 106 1.5 . 106 2.0 . 106 No. of Unknowns Parallel Black Box H-LU Preconditioning 24/26 Numerical Experiments Parallel Performance Parallel speedup for algebraic H-LU factorisation in 30 R2 and R3. 2 n = 2895 n = 1283 Speedup 25 20 15 10 5 5 10 15 20 25 30 No. of Processors Parallel Black Box H-LU Preconditioning 25/26 Literature L. Grasedyck, R. Kriemann and S. Le Borne, Domain Decomposition Based H-LU Preconditioning, to appear in “Numerische Mathematik”. L. Grasedyck, R. Kriemann and S. Le Borne, Parallel Black Box H-LU Preconditioning for Elliptic Boundary Value Problems, “Computing and Visualization in Science”, 11(4-6), pp. 273–291, 2008. L. Grasedyck, W. Hackbusch and R. Kriemann, Performance of H-LU Preconditioning for Sparse Matrices, to appear in “Computational Methods in Applied Mathematics”. pro H-Lib http://www.hlibpro.org H Lib pro Parallel Black Box H-LU Preconditioning 26/26