• Day 1: Overview
• Day 2: Direct methods
• Day 3: Iterative methods
• The conjugate gradient algorithm
• Parallel conjugate gradient and graph partitioning
• Preconditioning methods and graph coloring
• Domain decomposition and multigrid
• Krylov subspace methods for other problems
• Complexity of iterative and direct methods
SuperLU-dist:
Iterative refinement to improve solution
Iterate:
• r = b – A*x
• backerr = max i
( r i
/ (|A|*|x| + |b|) i
)
• if backerr < ε or backerr > lasterr/2 then stop iterating
• solve L*U*dx = r
• x = x + dx
• lasterr = backerr
• repeat
Usually 0 – 3 steps are enough
• iterative refinement
Convergence analysis of iterative refinement
Let C = I – A(LU) -1 [ so A = (I – C)·(LU) ] x
1 r
1
= (LU) -1 b dx
1 x
2
= (LU) -1
= x
1
+dx
1 r
1
= (LU) -1 Cb
= (LU) -1 (I + C)b r
2
= b – Ax
1
= (I – A(LU) -1 )b = Cb
= b – Ax
2
= (I – (I – C)·(I + C))b = C 2 b
. . .
In general, r k
= b – Ax k
= C k b
Thus r k
0 if |largest eigenvalue of C| < 1.
The Landscape of Sparse Ax=b Solvers
Nonsymmetric
Symmetric positive definite
Direct
A = LU
Pivoting
LU
Iterative y’ = Ay
GMRES,
QMR, …
Cholesky Conjugate gradient
More Robust
More General
More Robust
Less Storage
D
x
0
= 0, r
0
= b, p for k = 1, 2, 3, . . .
0
= r
0
α k
= (r T k-1 r k-1
) / (p T k-1
Ap k-1
) step length x k
= x k-1
+ α k r k
= r k-1
– α k p
Ap k-1 approx solution k-1 residual
β k
= (r T k r k p k
= r k
+ β k
) / (r T k-1 p r k-1
) improvement k-1 search direction
• One matrix-vector multiplication per iteration
• Two vector dot products per iteration
• Four n-vectors of working storage
• Eigenvalues:
Au = λu { λ
1
, λ
2
, . . ., λ n
}
• Cayley-Hamilton theorem:
(A – λ
1
I)·(A – λ
2
I) · · · (A – λ n
I) = 0
Therefore
0 i n c i
A i = 0 for some c i so
A -1 =
(–c i
/c
0
) A i–1
1 i n
• Krylov subspace:
Therefore if
Ax = b
, then x = A -1 b and x span (b, Ab, A 2 b, . . ., A n-1 b) = K n
(A, b)
• Krylov subspace:
K i
(A, b) = span (b, Ab, A 2 b, . . ., A i-1 b)
• Conjugate gradient algorithm: for i = 1, 2, 3, . . .
find x i
K i
(A, b) such that r i
= (Ax i
– b) K i
(A, b)
• Notice r i
K i+1
(A, b), so r i
r j for all j < i
• Similarly, the “directions” are A-orthogonal:
(x i
– x i-1
) T ·A· (x j
– x j-1
) = 0
• The magic: Short recurrences. . .
A is symmetric => can get next residual and direction from the previous one, without saving them all.
• In exact arithmetic, CG converges in n steps
(completely unrealistic!!)
• Accuracy after k steps of CG is related to:
• consider polynomials of degree k that are equal to 1 at 0.
• how small can such a polynomial be at all the eigenvalues of A?
• Thus, eigenvalues close together are good.
• Condition number:
κ (A) = ||A||
2
||A -1 ||
2
= λ max
(A) / λ min
(A)
• Residual is reduced by a constant factor by
O(κ 1/2 (A)) iterations of CG.
• CG on grid5(15) and bcsstk08
• n steps of CG on bcsstk08
Conjugate gradient: Parallel implementation
• Lay out matrix and vectors by rows
• Hard part is matrix-vector product y = A*x
• Algorithm
Each processor j:
Broadcast x(j)
Compute y(j) = A(j,:)*x
• May send more of x than needed
• Partition / reorder matrix to reduce communication y
P0 P1 P2 P3 x
P0
P1
P2
P3
• 2-way partition of eppstein mesh
• 8-way dice of eppstein mesh
• Suppose you had a matrix B such that:
1.
condition number
κ
(B -1 A) is small
2.
By = z is easy to solve
• Then you could solve (B -1 A)x = B -1 b instead of Ax = b
• B = A is great for (1), not for (2)
• B = I is great for (2), not for (1)
• Domain-specific approximations sometimes work
• B = diagonal of A sometimes works
• Or, bring back the direct methods technology. . .
• bcsstk08 with diagonal precond
Incomplete Cholesky factorization (IC, ILU) x
A R T R
• Compute factors of A by Gaussian elimination, but ignore fill
• Preconditioner B = R T R
A, not formed explicitly
• Compute B -1 z by triangular solves (in time nnz(A))
• Total storage is O(nnz(A)), static data structure
• Either symmetric (IC) or nonsymmetric (ILU)
• bcsstk08 with ic precond
Incomplete Cholesky and ILU: Variants
• Allow one or more “levels of fill”
• unpredictable storage requirements
1
2
4
3
• Allow fill whose magnitude exceeds a “drop tolerance”
• may get better approximate factors than levels of fill
• unpredictable storage requirements
• choice of tolerance is ad hoc
• Partial pivoting (for nonsymmetric A)
• “Modified ILU” (MIC): Add dropped fill to diagonal of U or R
• A and R T R have same row sums
• good in some PDE contexts
1
2
4
3
Incomplete Cholesky and ILU: Issues
• Choice of parameters
• good: smooth transition from iterative to direct methods
• bad: very ad hoc, problem-dependent
• tradeoff: time per iteration (more fill => more time) vs # of iterations (more fill => fewer iters)
• Effectiveness
• condition number usually improves (only) by constant factor
(except MIC for some problems from PDEs)
• still, often good when tuned for a particular class of problems
• Parallelism
• Triangular solves are not very parallel
• Reordering for parallel triangular solve by graph coloring
• 2-coloring of grid5(15)
Sparse approximate inverses
A B -1
• Compute B -1 A explicitly
• Minimize || B -1 A – I ||
F
(in parallel, by columns)
• Variants: factored form of B -1
, more fill, . .
• Good: very parallel
• Bad: effectiveness varies widely
Support graph preconditioners: example
[Vaidya]
G(A) G(B)
• A is symmetric positive definite with negative off-diagonal nzs
• B is a maximum-weight spanning tree for A
(with diagonal modified to preserve row sums)
• factor B in O(n) space and O(n) time
• applying the preconditioner costs O(n) time per iteration
Support graph preconditioners: example
G(A) G(B)
• support each edge of A by a path in B
• dilation( A edge ) = length of supporting path in B
• congestion( B edge ) = # of supported A edges
• p = max congestion, q = max dilation
• condition number
κ
(B -1 A) bounded by p·q (at most O(n 2 ))
Support graph preconditioners: example
G(A) G(B)
• can improve congestion and dilation by adding a few strategically chosen edges to B
• cost of factor+solve is O(n 1.75
), or O(n 1.2
) if A is planar
• in recent experiments [Chen & Toledo], often better than drop-tolerance MIC for 2D problems, but not for 3D.
Domain decomposition (introduction)
B 0 E
A =
0 C F
E T F T G
• Partition the problem (e.g. the mesh) into subdomains
• Use solvers for the subdomains B -1 and C -1 to precondition an iterative solver on the interface
• Interface system is the Schur complement:
S = G – E T B -1 E – F T C -1 F
• Parallelizes naturally by subdomains
• grid and matrix structure for overlapping 2-way partition of eppstein
(introduction)
• For a PDE on a fine mesh, precondition using a solution on a coarser mesh
• Use idea recursively on hierarchy of meshes
• Solves the model problem (Poisson’s eqn) in linear time!
• Often useful when hierarchy of meshes can be built
• Hard to parallelize coarse meshes well
• This is just the intuition – lots of theory and technology
• Nonsymmetric linear systems:
• GMRES: for i = 1, 2, 3, . . .
find x i
K i
(A, b) such that r i
= (Ax i
– b) K i
(A, b)
But, no short recurrence => save old vectors => lots more space
• BiCGStab, QMR, etc.:
Two spaces K i
(A, b) and K i
(A T , b) w/ mutually orthogonal bases
Short recurrences => O(n) space, but less robust
• Convergence and preconditioning more delicate than CG
• Active area of current research
• Eigenvalues: Lanczos (symmetric), Arnoldi (nonsymmetric)
The Landscape of Sparse Ax=b Solvers
Nonsymmetric
Symmetric positive definite
Direct
A = LU
Pivoting
LU
Iterative y’ = Ay
GMRES,
QMR, …
Cholesky Conjugate gradient
More Robust
More General
More Robust
Less Storage
Complexity of direct methods
Time and space to solve any problem on any wellshaped finite element mesh n 1/2 n 1/3
Space (fill):
Time (flops):
2D
O(n log n)
O(n 3/2 )
3D
O(n 4/3 )
O(n 2 )
Complexity of linear solvers
Time to solve model problem
(Poisson’s equation) on regular mesh n 1/2
Sparse Cholesky:
CG, exact arithmetic:
CG, no precond:
CG, modified IC:
CG, support trees:
Multigrid:
2D
O(n 1.5 )
O(n 2 )
O(n 1.5 )
O(n 1.25 )
O(n 1.20 )
O(n) n 1/3
3D
O(n 2 )
O(n 2 )
O(n 1.33 )
O(n 1.17 )
O(n 1.75 )
O(n)