Communication-Avoiding Krylov Subspace Methods in Finite Precision

advertisement
Communication-avoiding
Krylov subspace methods in finite
precision
Erin Carson, UC Berkeley
Advisor: James Demmel
ICME LA/Opt Seminar, Stanford University
December 4, 2014
Model Problem: 2D Poisson on 2562 grid, κ(๐ด) = 3.9๐ธ4, ๐ด 2 = 1.99. Equilibration
(diagonal scaling) used. RHS set s.t. elements of true solution ๐‘ฅ๐‘– = 1/256.
Residual
replacement
strategy
of with
vaninder
Vorst andaccuracy
Ye variants.
(1999)
We We
can
can
combine
extend
residual
this
strategy
replacement
communication-avoiding
inexpensive
techniques
Roundoff
error
cause
decrease
attainable
of for
This
affect
can
becan
worse
forato
“communication-avoiding”
(๐‘ -step)
improves
attainable
accuracy
for
classical
methods
maintaining
Accuracy
convergence
can
beKrylov
improved
rate
closer
for
to
minimal
that
of performance
theKrylov
classical
method!
cost!
subspace
methods.
Krylov
subspace
methods
– limits
practice
applicability!
2
What is communication?
• Algorithms have two costs: communication and computation
• Communication : moving data between levels of memory hierarchy
(sequential), between processors (parallel)
sequential comm.
parallel comm.
• On modern computers, communication is expensive, computation is cheap
– Flop time << 1/bandwidth << latency
– Communication bottleneck a barrier to achieving scalability
– Also expensive in terms of energy cost!
• Towards exascale: we must redesign algorithms to avoid communication
3
Communication bottleneck in KSMs
• A Krylov Subspace Method is a projection process onto the Krylov subspace
๐’ฆ๐‘š ๐ด, ๐‘Ÿ0 = span ๐‘Ÿ0 , ๐ด๐‘Ÿ0 , ๐ด2 ๐‘Ÿ0 , … , ๐ด๐‘š−1 ๐‘Ÿ0
• Linear systems, eigenvalue problems, singular value problems, least squares, etc.
• Best for: ๐ด large & very sparse, stored implicitly, or only approximation needed
Projection process in each iteration:
1. Add a dimension to ๐’ฆ๐‘š
• Sparse Matrix-Vector Multiplication (SpMV)
• Parallel: comm. vector entries w/ neighbors
• Sequential: read ๐ด/vectors from slow memory
2. Orthogonalize (with respect to some โ„’๐‘š )
• Inner products
• Parallel: global reduction
• Sequential: multiple reads/writes to slow memory
SpMV
orthogonalize
Dependencies between
communication-bound
kernels in each iteration
limit performance!
4
Example: Classical conjugate gradient (CG)
Given: initial approximation ๐‘ฅ0 for solving ๐ด๐‘ฅ = ๐‘
Let ๐‘0 = ๐‘Ÿ0 = ๐‘ − ๐ด๐‘ฅ0
for ๐‘š = 1, 2, … , until convergence do
๐›ผ๐‘š−1 =
๐‘‡
๐‘Ÿ๐‘š−1
๐‘Ÿ๐‘š−1
๐‘‡
๐‘๐‘š−1
๐ด๐‘๐‘š−1
๐‘ฅ๐‘š = ๐‘ฅ๐‘š−1 + ๐›ผ๐‘š−1 ๐‘๐‘š−1
๐‘Ÿ๐‘š = ๐‘Ÿ๐‘š−1 − ๐›ผ๐‘š−1 ๐ด๐‘๐‘š−1
๐›ฝ๐‘š =
๐‘‡๐‘Ÿ
๐‘Ÿ๐‘š
๐‘š
๐‘‡
๐‘Ÿ๐‘š−1
๐‘Ÿ๐‘š−1
SpMVs and inner products
require communication in
each iteration!
๐‘๐‘š = ๐‘Ÿ๐‘š + ๐›ฝ๐‘š ๐‘๐‘š−1
end for
5
Communication-avoiding CA-KSMs
• Krylov methods can be reorganized to reduce communication cost by ๐‘ถ(๐’”)
• Many CA-KSMs (๐‘ -step KSMs) in the literature: CG, GMRES, Orthomin,
MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR
Related references:
(Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and
Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991),
(Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel,
1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der
Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen,
2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013),
(Feuerriegel and Bücker, 2013).
• Reduction in communication can translate to speedups on practical problems
• Recent results: ๐Ÿ’. ๐Ÿx speedup with ๐‘  = 4 for CA-BICGSTAB in GMG
bottom-solve; 2.5x in overall GMG solve (24,576 cores, ๐‘› = 10243 on
Cray XE6) (Williams, et al., 2014).
6
Benchmark timing breakdown
• Plot: Net time spent on different operations over one MG solve using
24,576 cores
• Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini
network, 3D torus
• 3D Helmholtz equation
• CS-BICGSTAB with ๐‘  = 4
7
CA-CG overview
Starting at iteration ๐‘ ๐‘˜ for ๐‘˜ ≥ 0, ๐‘  > 0, it can be shown that to compute the
next ๐‘  steps (iterations ๐‘ ๐‘˜ + ๐‘— where ๐‘— ≤ ๐‘ ),
๐‘๐‘ ๐‘˜+๐‘— , ๐‘Ÿ๐‘ ๐‘˜+๐‘— , ๐‘ฅ๐‘ ๐‘˜+๐‘— − ๐‘ฅ๐‘ ๐‘˜ ∈ ๐’ฆ๐‘ +1 ๐ด, ๐‘๐‘ ๐‘˜ + ๐’ฆ๐‘  ๐ด, ๐‘Ÿ๐‘ ๐‘˜ .
1. Compute basis matrix
๐‘Œ๐‘˜ = ๐œŒ0 ๐ด ๐‘๐‘ ๐‘˜ , … , ๐œŒ๐‘  ๐ด ๐‘๐‘ ๐‘˜ , ๐œŒ0 ๐ด ๐‘Ÿ๐‘ ๐‘˜ , … , ๐œŒ๐‘ −1 ๐ด ๐‘Ÿ๐‘ ๐‘˜
of dimension ๐‘›-by-(2๐‘ +1), where ๐œŒ๐‘— is a polynomial of degree ๐‘—.
This gives the recurrence relation
๐ด๐‘Œ๐‘˜ = ๐‘Œ๐‘˜ ๐ต๐‘˜ ,
where ๐ต๐‘˜ is (2๐‘ +1)-by-(2๐‘ +1) and is 2x2 block diagonal with upper
Hessenberg blocks, and ๐‘Œ๐‘˜ is ๐‘Œ๐‘˜ with columns ๐‘ +1 and 2๐‘ +1 set to 0.
• Communication cost: Same latency cost as one SpMV using matrix powers
kernel (e.g., Hoemmen et al., 2007), assuming sufficient sparsity structure
(low diameter)
8
The matrix powers kernele Matrix Powers
Avoids communication:
• In serial, by exploiting temporal locality:
• Reading ๐ด, reading ๐‘Œ = [๐‘ฅ, ๐ด๐‘ฅ, ๐ด2๐‘ฅ, … , ๐ด๐‘ ๐‘ฅ]
• In parallel, by doing only 1 ‘expand’ phase
(instead of ๐‘ ).
• Requires sufficiently low ‘surface-to-volume’
ratio
Also works for
Tridiagonal Example:
general graphs!
black = local elements
red = 1-level dependencies
green = 2-level dependencies
blue = 3-level dependencies
A3v
A2v
Av
v
Sequential
A3v
A2v
Av
v
Parallel
9
CA-CG overview
2. Orthogonalize: Encode dot products between basis vectors by
computing Gram matrix ๐บ๐‘˜ = ๐‘Œ๐‘˜๐‘‡ ๐‘Œ๐‘˜ of dimension (2๐‘ +1)-by- 2๐‘ +1
(or compute Tall-Skinny QR)
• Communication cost: of one global reduction
3. Perform ๐’” iterations of updates
• Using ๐‘Œ๐‘˜ and ๐บ๐‘˜ , this requires no communication
• Represent ๐‘›-vectors by their length-(2๐‘  + 1) coordinates in ๐‘Œ๐‘˜ :
๐‘ฅ๐‘ ๐‘˜+๐‘— − ๐‘ฅ๐‘ ๐‘˜ = ๐‘Œ๐‘˜ ๐‘ฅ๐‘—′ ,
๐‘Ÿ๐‘ ๐‘˜+๐‘— = ๐‘Œ๐‘˜ ๐‘Ÿ๐‘—′ , ๐‘๐‘ ๐‘˜+๐‘— = ๐‘Œ๐‘˜ ๐‘๐‘—′
• Perform ๐‘  iterations of updates on coordinate vectors
4. Basis change to recover CG vectors
′
′
′
• Compute ๐‘ฅ๐‘ ๐‘˜+๐‘  −๐‘ฅ๐‘ ๐‘˜ , ๐‘Ÿ๐‘ ๐‘˜+๐‘  , ๐‘๐‘ ๐‘˜+๐‘  = ๐‘Œ๐‘˜ [๐‘ฅ๐‘˜,๐‘ 
, ๐‘Ÿ๐‘˜,๐‘ 
, ๐‘๐‘˜,๐‘ 
] locally,
no communication
10
No communication in inner loop!
CG
for ๐‘— = 1, … , ๐‘  do
๐›ผ๐‘ ๐‘˜+๐‘—−1 =
CA-CG
Compute ๐‘Œ๐‘˜ such that ๐ด๐‘Œ๐‘˜ = ๐‘Œ๐‘˜ ๐ต๐‘˜
Compute ๐บ๐‘˜ = ๐‘Œ๐‘˜๐‘‡ ๐‘Œ๐‘˜
for ๐‘— = 1, … , ๐‘  do
๐‘‡
๐‘Ÿ๐‘ ๐‘˜+๐‘—−1
๐‘Ÿ๐‘ ๐‘˜+๐‘—−1
๐›ผ๐‘ ๐‘˜+๐‘—−1 =
๐‘‡
๐‘๐‘ ๐‘˜+๐‘—−1
๐ด๐‘๐‘ ๐‘˜+๐‘—−1
′๐‘‡
′
๐‘Ÿ๐‘˜,๐‘—−1
๐บ๐‘˜ ๐‘Ÿ๐‘˜,๐‘—−1
′๐‘‡
′
๐‘๐‘˜,๐‘—−1
๐บ๐‘˜ ๐ต๐‘˜ ๐‘๐‘˜,๐‘—−1
๐‘ฅ๐‘ ๐‘˜+๐‘— = ๐‘ฅ๐‘ ๐‘˜+๐‘—−1 + ๐›ผ๐‘ ๐‘˜+๐‘—−1 ๐‘๐‘ ๐‘˜+๐‘—−1
′
′
′
๐‘ฅ๐‘˜,๐‘—
= ๐‘ฅ๐‘˜,๐‘—−1
+ ๐›ผ๐‘ ๐‘˜+๐‘—−1 ๐‘๐‘˜,๐‘—−1
๐‘Ÿ๐‘ ๐‘˜+๐‘— = ๐‘Ÿ๐‘ ๐‘˜+๐‘—−1 − ๐›ผ๐‘ ๐‘˜+๐‘—−1 ๐ด๐‘๐‘ ๐‘˜+๐‘—−1
′
′
′
๐‘Ÿ๐‘˜,๐‘—
= ๐‘Ÿ๐‘˜,๐‘—−1
− ๐›ผ๐‘ ๐‘˜+๐‘—−1 ๐ต๐‘˜ ๐‘๐‘˜,๐‘—−1
๐›ฝ๐‘ ๐‘˜+๐‘— =
๐‘‡
๐‘Ÿ๐‘ ๐‘˜+๐‘—
๐‘Ÿ๐‘ ๐‘˜+๐‘—
๐›ฝ๐‘ ๐‘˜+๐‘— =
๐‘‡
๐‘Ÿ๐‘ ๐‘˜+๐‘—−1
๐‘Ÿ๐‘ ๐‘˜+๐‘—−1
′๐‘‡
′
๐‘Ÿ๐‘˜,๐‘—−1
๐บ๐‘˜ ๐‘Ÿ๐‘˜,๐‘—−1
′
′
′
๐‘๐‘˜,๐‘—
= ๐‘Ÿ๐‘˜,๐‘—
+ ๐›ฝ๐‘ ๐‘˜+๐‘— ๐‘๐‘˜,๐‘—−1
๐‘๐‘ ๐‘˜+๐‘— = ๐‘Ÿ๐‘ ๐‘˜+๐‘— +๐›ฝ๐‘ ๐‘˜+๐‘— ๐‘๐‘ ๐‘˜+๐‘—−1
end for
′๐‘‡
′
๐‘Ÿ๐‘˜,๐‘—
๐บ๐‘˜ ๐‘Ÿ๐‘˜,๐‘—
end for
′
′
′
๐‘ฅ๐‘ ๐‘˜+๐‘  −๐‘ฅ๐‘ ๐‘˜ , ๐‘Ÿ๐‘ ๐‘˜+๐‘  , ๐‘๐‘ ๐‘˜+๐‘  = ๐‘Œ๐‘˜ [๐‘ฅ๐‘˜,๐‘ 
, ๐‘Ÿ๐‘˜,๐‘ 
, ๐‘๐‘˜,๐‘ 
]
Krylov methods in finite precision
• Finite precision errors have effects on algorithm behavior (known to
Lanczos (1950); see, e.g., Meurant and Strakoš (2006) for in-depth survey)
• Lanczos:
• Basis vectors lose orthogonality
• Appearance of multiple Ritz approximations to some eigenvalues
• Delays convergence of Ritz values to other eigenvalues
• CG:
• Residual vectors (proportional to Lanczos basis vectors) lose
orthogonality
• Appearance of duplicate Ritz values delays convergence of CG
approximate solution
• Residual vectors ๐‘Ÿ๐‘˜ and true residuals ๐‘ − ๐ด๐‘ฅ๐‘˜ deviate
• Limits the maximum attainable accuracy of approximate solution
12
CA-Krylov methods in finite precision
• CA-KSMs mathematically equivalent to classical KSMs
• But convergence delay and loss of accuracy gets worse with
increasing ๐’”!
• Obstacle to solving practical problems
• Decrease in attainable accuracy → some problems that
KSM can solve can’t be solved with CA variant
• Delay of convergence → if # iterations increases more than
time per iteration decreases due to CA techniques, no
speedup expected!
13
This raises the questions…
For CA-KSMs,
For solving linear systems:
• How bad can the effects of roundoff error be?
• Bound on maximum attainable accuracy for CA-CG and CABICG
• And what can we do about it?
• Residual replacement strategy: uses bound to improve
accuracy to ๐‘‚(๐œ€) ๐ด ๐‘ฅ in CA-CG and CA-BICG
14
This raises the questions…
For CA-KSMs,
For eigenvalue problems:
• How bad can the effects of roundoff error be?
• Extension of Chris Paige’s analysis of Lanczos to CA variants:
• Bounds on local rounding errors in CA-Lanczos
• Bounds on accuracy and convergence of Ritz values
• Loss of orthogonality ๏‚ฎ convergence of Ritz values
• And what can we do about it?
• Some ideas on how to use this new analysis…
• Basis orthogonalization
• Dynamically select/change basis size (๐’” parameter)
• Mixed precision variants
15
Maximum attainable accuracy of CG
• In classical CG, iterates are updated by
๐‘ฅ๐‘š = ๐‘ฅ๐‘š−1 + ๐›ผ๐‘š−1 ๐‘๐‘š−1
and
๐‘Ÿ๐‘š = ๐‘Ÿ๐‘š−1 − ๐›ผ๐‘š−1 ๐ด๐‘๐‘š−1
• Formulas for ๐‘ฅ๐‘š and ๐‘Ÿ๐‘š do not depend on each other - rounding errors
cause the true residual, ๐‘ − ๐ด๐‘ฅ๐‘š , and the updated residual, ๐‘Ÿ๐‘š , to deviate
• The size of the true residual is bounded by
๐‘ − ๐ด๐‘ฅ๐‘š ≤ ๐‘Ÿ๐‘š + ๐‘ − ๐ด๐‘ฅ๐‘š − ๐‘Ÿ๐‘š
• When ๐‘Ÿ๐‘š โ‰ซ ๐‘ − ๐ด๐‘ฅ๐‘š − ๐‘Ÿ๐‘š , ๐‘Ÿ๐‘š and ๐‘ − ๐ด๐‘ฅ๐‘š have similar
magnitude
• When ๐‘Ÿ๐‘š → 0, ๐‘ − ๐ด๐‘ฅ๐‘š depends on ๐‘ − ๐ด๐‘ฅ๐‘š − ๐‘Ÿ๐‘š
• Many results on attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997),
Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and
Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and
Strakoš (2000).
16
Example: Comparison of convergence of
true and updated residuals for CG vs. CACG using a monomial basis, for various ๐‘ 
values
Model problem (2D Poisson on 2562 grid)
17
• Better conditioned polynomial bases
can be used instead of monomial.
• Two common choices: Newton and
Chebyshev - see, e.g., (Philippe and
Reichel, 2012).
• Behavior closer to that of classical CG
• But can still see some loss of attainable
accuracy compared to CG
• Clearer for higher ๐’” values
18
Residual replacement strategy for CG
• van der Vorst and Ye (1999): Improve accuracy by replacing updated
residual ๐’“๐’Ž by the true residual ๐’ƒ − ๐‘จ๐’™๐’Ž in certain iterations, combined
with group update.
• Choose when to replace ๐‘Ÿ๐‘š with ๐‘ − ๐ด๐‘ฅ๐‘š to meet two constraints:
1. Replace often enough so that at termination, ๐‘ − ๐ด๐‘ฅ๐‘š − ๐‘Ÿ๐‘š is
small relative to ๐œ€๐‘ ๐ด
๐‘ฅ๐‘š
2. Don’t replace so often that original convergence mechanism of
updated residuals is destroyed (avoid large perturbations to finite
precision CG recurrence)
19
Residual replacement condition for CG
Tong and Ye (2000): In finite precision classical CG, in iteration ๐‘›, the
computed residual and search direction vectors satisfy
๐‘Ÿ๐‘š = ๐‘Ÿ๐‘š−1 − ๐›ผ๐‘š−1 ๐ด๐‘๐‘š−1 + ๐œ‚๐‘š ,
Then in matrix form,
1 ๐‘Ÿ๐‘š+1 ๐‘‡
๐ด๐‘๐‘š = ๐‘๐‘š ๐‘‡๐‘š − ′
๐‘’๐‘š+1 + ๐น๐‘š
๐›ผ๐‘š ๐‘Ÿ0
๐‘๐‘š = ๐‘Ÿ๐‘š + ๐›ฝ๐‘š ๐‘๐‘š−1 + ๐œ๐‘š
with ๐‘๐‘š =
๐‘Ÿ0
๐‘Ÿ๐‘š
,…,
๐‘Ÿ0
๐‘Ÿ๐‘š
′ = ๐‘’ ๐‘‡ ๐‘‡ −1 ๐‘’ and ๐น =
where ๐‘‡๐‘š is invertible and tridiagonal, ๐›ผ๐‘š
๐‘š ๐‘š 1
๐‘š
๐‘“0 , … , ๐‘“๐‘š , with
๐ด๐œ๐‘š
1 ๐œ‚๐‘š+1
๐›ฝ๐‘š ๐œ‚๐‘š
๐‘“๐‘š =
+
−
๐‘Ÿ๐‘š
๐›ผ๐‘š ๐‘Ÿ๐‘š
๐›ผ๐‘š−1 ๐‘Ÿ๐‘š
20
Residual replacement condition for CG
Tong and Ye (2000): If sequence ๐‘Ÿ๐‘š satisfies
๐ด๐‘๐‘š = ๐‘๐‘š ๐‘‡๐‘š −
1 ๐‘Ÿ๐‘š+1 ๐‘‡
๐‘’๐‘š+1
′
๐›ผ๐‘š
๐‘Ÿ0
+ ๐น๐‘š ,
๐‘“๐‘š =
๐ด๐œ๐‘š
๐‘Ÿ๐‘š
1 ๐œ‚๐‘š+1
+
๐›ผ๐‘š ๐‘Ÿ๐‘š
๐›ฝ๐‘š ๐œ‚๐‘š
−
๐›ผ๐‘š−1 ๐‘Ÿ๐‘š
then
๐‘Ÿ๐‘š+1 ≤ 1 + ๐’ฆ๐‘š
where ๐’ฆ๐‘š =
min
๐œŒ∈๐’ซ๐‘š ,๐œŒ 0 =1
โ€–๐œŒ ๐ด + Δ๐ด๐‘š ๐‘Ÿ1 โ€–
+
+.
๐ด๐‘๐‘š − ๐น๐‘š ๐‘‡๐‘š−1 โ€–๐‘๐‘š+1
โ€– and Δ๐ด๐‘š = −๐น๐‘š ๐‘๐‘š
• As long as ๐‘Ÿ๐‘š satisfies recurrence, bound on its norm holds regardless of
how it is generated
• Can replace ๐‘Ÿ๐‘š by fl(๐‘ − ๐ด๐‘ฅ๐‘š ) and still expect convergence whenever
๐œ‚๐‘š = fl ๐‘ − ๐ด๐‘ฅ๐‘š − ๐‘Ÿ๐‘š−1 − ๐›ผ๐‘š−1 ๐ด๐‘๐‘š−1
is not too large relative to ๐‘Ÿ๐‘š and ๐‘Ÿ๐‘š−1 (van der Vorst & Ye, 1999).
21
Residual replacement strategy for CG
• (van der Vorst and Ye, 1999): Use computable bound for โ€–๐‘ − ๐ด๐‘ฅ๐‘š −
Pseudo-code for residual replacement with group update for CG:
if
end
๐‘‘๐‘š−1 ≤ ๐œ€ ๐‘Ÿ๐‘š−1 ๐š๐ง๐ ๐‘‘๐‘š > ๐œ€ ๐‘Ÿ๐‘š ๐š๐ง๐ ๐‘‘๐‘š > 1.1๐‘‘๐‘–๐‘›๐‘–๐‘ก
๐‘ง = ๐‘ง + ๐‘ฅ๐‘š
group update of approximate solution
๐‘ฅ๐‘š = 0
set residual to true residual
๐‘Ÿ๐‘š = ๐‘ − ๐ด๐‘ง
๐‘‘๐‘–๐‘›๐‘–๐‘ก = ๐‘‘๐‘š = ๐œ€ ๐‘ ๐ด ๐‘ง + ๐‘Ÿ๐‘š
• Assuming a bound on condition number of ๐ด, if updated residual
converges to ๐‘‚(๐œ€) ๐ด ๐‘ฅ , true residual reduced to ๐‘‚(๐œ€) ๐ด ๐‘ฅ
22
Sources of roundoff error in CA-CG
Computing the ๐‘ -step Krylov basis:
๐ด๐‘Œ๐‘˜ = ๐‘Œ๐‘˜ ๐ต๐‘˜ + โˆ†๐‘Œ๐‘˜
Error in computing
๐‘ -step basis
Updating coordinate vectors in the inner loop:
′
′
′
๐‘ฅ๐‘˜,๐‘—
= ๐‘ฅ๐‘˜,๐‘—−1
+ ๐‘ž๐‘˜,๐‘—−1
+ ๐œ‰๐‘˜,๐‘—
′
′
′
๐‘Ÿ๐‘˜,๐‘—
= ๐‘Ÿ๐‘˜,๐‘—−1
− ๐ต๐‘˜ ๐‘ž๐‘˜,๐‘—−1
+ ๐œ‚๐‘˜,๐‘—
Error in updating
coefficient vectors
′
′
with ๐‘ž๐‘˜,๐‘—−1
= fl(๐›ผ๐‘ ๐‘˜+๐‘—−1 ๐‘๐‘˜,๐‘—−1
)
Recovering CG vectors for use in next outer loop:
′
๐‘ฅ๐‘ ๐‘˜+๐‘— = ๐‘Œ๐‘˜ ๐‘ฅ๐‘˜,๐‘—
+ ๐‘ฅ๐‘ ๐‘˜ + ๐œ™๐‘ ๐‘˜+๐‘—
′
๐‘Ÿ๐‘ ๐‘˜+๐‘— = ๐‘Œ๐‘˜ ๐‘Ÿ๐‘˜,๐‘—
+ ๐œ“๐‘ ๐‘˜+๐‘—
Error in
basis change
23
Maximum attainable accuracy of CA-CG
• We can write the deviation of the true and updated residuals in terms of
these errors:
๐‘−๐ด๐‘ฅ๐‘ ๐‘˜+๐‘— −๐‘Ÿ๐‘ ๐‘˜+๐‘— = ๐‘−๐ด๐‘ฅ0 −๐‘Ÿ0
๐‘˜−1
−
๐‘ 
′
๐ด๐‘Œโ„“ ๐œ‰โ„“,๐‘– + ๐‘Œโ„“ ๐œ‚โ„“,๐‘– − Δ๐‘Œโ„“ ๐‘žโ„“,๐‘–−1
๐ด๐œ™๐‘ โ„“+๐‘  + ๐œ“๐‘ โ„“+๐‘  +
โ„“=0
๐‘–=1
๐‘—
′
๐ด๐‘Œ๐‘˜ ๐œ‰๐‘˜,๐‘– + ๐‘Œ๐‘˜ ๐œ‚๐‘˜,๐‘– − Δ๐‘Œ๐‘˜ ๐‘ž๐‘˜,๐‘–−1
−๐ด๐œ™๐‘ ๐‘˜+๐‘— − ๐œ“๐‘ ๐‘˜+๐‘— −
๐‘–=1
• Using standard rounding error results, this allows us to obtain an upper
bound on ๐‘−๐ด๐‘ฅ๐‘ ๐‘˜+๐‘— −๐‘Ÿ๐‘ ๐‘˜+๐‘— .
24
A computable bound
• We extend van der Vorst and Ye’s residual replacement strategy to CA-CG
• Making use of the bound for ๐‘−๐ด๐‘ฅ๐‘ ๐‘˜+๐‘— −๐‘Ÿ๐‘ ๐‘˜+๐‘— in CA-CG, update error
estimate ๐‘‘๐‘ ๐‘˜+๐‘— by:
Extra๐Ÿ‘computation
all lower order terms, communication
๐Ÿ‘
๐‘ถ(๐’”
)
flops
per
๐’”
iterations;
≤1
reduction
per ๐’” iterations
๐‘ถ(๐’ + ๐’”by
) flops
perfactor
๐’” iterations;
no communication
only
once
increased
at Estimated
most
of 2.
๐‘‘๐‘ ๐‘˜+๐‘— ≡ ๐‘‘๐‘ ๐‘˜+๐‘—−1
+๐œ€ 4+๐‘′
+๐œ€
๐ด
๐ด
′
๐‘Œ๐‘˜ โˆ™ ๐‘ฅ๐‘˜,๐‘—
+
๐‘ฅ๐‘ ๐‘˜+๐‘  + 2+2๐‘ ′ ๐ด
′
๐‘Œ๐‘˜ โˆ™ ๐ต๐‘˜ โˆ™ ๐‘ฅ๐‘˜,๐‘—
+
′
๐‘Œ๐‘˜ โˆ™ ๐‘Ÿ๐‘˜,๐‘—
′
′
๐‘Œ๐‘˜ โˆ™ ๐‘ฅ๐‘˜,๐‘ 
+๐‘ ′ ๐‘Œ๐‘˜ โˆ™ ๐‘Ÿ๐‘˜,๐‘ 
,๐‘—=๐‘ 
0, otherwise
where ๐‘ ′ = max ๐‘, 2๐‘  + 1 .
25
Residual replacement for CA-CG
• Use the same replacement condition as van der Vorst and Ye (1999):
๐‘‘๐‘ ๐‘˜+๐‘—−1 ≤ ๐œ€ ๐‘Ÿ๐‘ ๐‘˜+๐‘—−1
๐š๐ง๐
๐‘‘๐‘ ๐‘˜+๐‘— > ๐œ€ ๐‘Ÿ๐‘ ๐‘˜+๐‘—
๐š๐ง๐ ๐‘‘๐‘ ๐‘˜+๐‘— > 1.1๐‘‘๐‘–๐‘›๐‘–๐‘ก
• (possible we could develop a better replacement condition based on
perturbation in finite precision CA-CG recurrence)
Pseudo-code for residual replacement with group update for CA-CG:
if ๐‘‘๐‘ ๐‘˜+๐‘—−1 ≤ ๐œ€ ๐‘Ÿ๐‘ ๐‘˜+๐‘—−1 ๐š๐ง๐
′
๐‘ง = ๐‘ง + ๐‘Œ๐‘˜ ๐‘ฅ๐‘˜,๐‘—
+ ๐‘ฅ๐‘ ๐‘˜
๐‘ฅ๐‘ ๐‘˜+๐‘— = 0
๐‘Ÿ๐‘ ๐‘˜+๐‘— = ๐‘ − ๐ด๐‘ง
๐‘‘๐‘ ๐‘˜+๐‘— > ๐œ€ ๐‘Ÿ๐‘ ๐‘˜+๐‘—
๐š๐ง๐ ๐‘‘๐‘ ๐‘˜+๐‘— > 1.1๐‘‘๐‘–๐‘›๐‘–๐‘ก
๐‘‘๐‘–๐‘›๐‘–๐‘ก = ๐‘‘๐‘ ๐‘˜+๐‘— = ๐œ€ 1 + 2๐‘′ ๐ด ๐‘ง + ๐‘Ÿ๐‘ ๐‘˜+๐‘—
′
๐‘๐‘ ๐‘˜+๐‘— = ๐‘Œ๐‘˜ ๐‘๐‘˜,๐‘—
break from inner loop and begin new outer loop
end
26
27
Residual
Replacement
Indices
Total Number
of Reductions
s=4
CACG Mono.
CACG
Newt.
CACG
Cheb.
354
203
353
196
365
197
340
102
353
99
326
68
346
71
401, 517
s=8 224, 334,
157
s=12
135,
2119
557
CG
355
669
•• #Inreplacements
small compared
to total
addition to attainable
accuracy,
iterations
๏‚ฎ RR
doesn’t
significantly affect
convergence
rate
is incredibly
communication
savings!
important in practical
implementations
•• Can
have bothrate
speed
and accuracy!
Convergence
depends
on basis
28
consph8, FEM/Spheres (from UFSMC)
n = 8.3 ⋅ 104 , ๐‘›๐‘›๐‘ง = 6.0 ⋅ 106 , ๐œ… ๐ด = 9.7 ⋅ 103 , ๐ด = 9.7
After
๐‘ =8
# Replacements
Class.
2
M
12
N
2
C
2
# Replacements
๐‘  =12
Before
Class.
2
M
0
N
4
C
3
29
xenon1, materials (from UFSMC)
n = 4.9 ⋅ 104 , ๐‘›๐‘›๐‘ง = 1.2 ⋅ 106 , ๐œ… ๐ด = 3.3 ⋅ 104 , ๐ด = 3.2
After
๐‘ =4
# Replacements
Class.
2
M
1
N
1
C
1
# Replacements
๐‘ =8
Before
Class.
2
M
5
N
2
C
1
30
Preconditioning for CA-KSMs
• Tradeoff: speed up convergence, but increase time per iteration due to
communication!
• For each specific app, must evaluate tradeoff between preconditioner
quality and sparsity of the system
• Good news: many preconditioners allow communication-avoiding approach
• Block Jacobi – block diagonals
• Sparse Approx. Inverse (SAI) – same sparsity as ๐ด; recent work for
CA-BICGSTAB by Mehri (2014)
• Polynomial preconditioning (Saad, 1985)
• HSS preconditioning for banded matrices (Hoemmen, 2010),
(Knight, C., Demmel, 2014)
• CA-ILU(0) – recent work by Moufawad and Grigori (2013)
• Deflation for CA-CG (C., Knight, Demmel, 2014), based on Deflated
CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014)
31
Deflated CA-CG, model problem
Monomial Basis,
๐‘ ๐‘ ==10
4
8
Matrix: 2D Laplacian(512), ๐‘ = 262,144. Right hand side set such that true solution
has entries ๐‘ฅ๐‘– = 1/ ๐‘›. Deflated CG algorithm (DCG) from (Saad et al., 2000).
32
Eigenvalue problems: CA-Lanczos
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
As ๐’” increases, both convergence rate and accuracy to which we can find
approximate eigenvalues within ๐’ iterations decreases.
33
Eigenvalue problems: CA-Lanczos
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
Improved by better-conditioned bases, but still see decreased
convergence rate and accuracy as ๐’” grows.
34
Paige’s Lanczos convergence analysis
๐‘‡ + ๐›ฟ๐‘‰
๐ด๐‘‰๐‘š = ๐‘‰๐‘š ๐‘‡๐‘š + ๐›ฝ๐‘š+1 ๐‘ฃ๐‘š+1 ๐‘’๐‘š
๐‘š
๐‘‰๐‘š = ๐‘ฃ1 , … , ๐‘ฃ๐‘š ,
๐›ฟ ๐‘‰๐‘š = ๐›ฟ ๐‘ฃ1 , … , ๐›ฟ ๐‘ฃ๐‘š ,
๐‘‡๐‘š =
๐›ผ1
๐›ฝ2
๐›ฝ2
โ‹ฑ
โ‹ฑ
โ‹ฑ
โ‹ฑ
๐›ฝ๐‘š
๐›ฝ๐‘š
๐›ผ๐‘š
for ๐‘– ∈ {1, … , ๐‘š},
Classic Lanczos rounding
error result of Paige (1976):
2
๐›ฝ๐‘–+1
where ๐œŽ ≡ ๐ด
2,
๐œƒ๐œŽ ≡
๐ด
๐›ฟ ๐‘ฃ๐‘– 2 ≤ ๐œ€1 ๐œŽ
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1 ≤ 2๐œ€0 ๐œŽ
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1 ≤ ๐œ€0 2
+ ๐›ผ๐‘–2 + ๐›ฝ๐‘–2 − ๐ด๐‘ฃ๐‘– 22 ≤ 4๐‘– 3๐œ€0 + ๐œ€1 ๐œŽ 2
2 , ๐œ€0
≡ 2๐œ€ ๐‘› + 4 , and ๐œ€1 ≡ 2๐œ€ ๐‘๐œƒ + 7
๐œ€0 = ๐‘‚ ๐œ€๐‘›
๐œ€1 = ๐‘‚ ๐œ€๐‘๐œƒ
๏‚ฎ These results form the basis for Paige’s influential results in (Paige, 1980).
35
CA-Lanczos convergence analysis
Let Γ ≡ max ๐‘Œโ„“+
โ„“≤๐‘˜
2
โˆ™
๐‘Œโ„“
2
≤ 2๐‘ +1 โˆ™ max ๐œ… ๐‘Œโ„“ .
โ„“≤๐‘˜
for ๐‘– ∈ {1, … , ๐‘š=๐‘ ๐‘˜+๐‘—},
For CA-Lanczos,
we have:
2
๐›ฝ๐‘–+1
๐›ฟ ๐‘ฃ๐‘– 2 ≤ ๐œ€1 ๐œŽ
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1 ≤ 2๐œ€0 ๐œŽ
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1 ≤ ๐œ€0 2
+ ๐›ผ๐‘–2 + ๐›ฝ๐‘–2 − ๐ด๐‘ฃ๐‘– 22 ≤ 4๐‘– 3๐œ€0 + ๐œ€1 ๐œŽ 2
๐œ€0 ≡ 2๐œ€ ๐‘›+11๐‘ +15 Γ 2 = ๐‘‚ ๐œ€๐‘›Γ 2 ,
(vs. ๐‘‚ ๐œ€๐‘› for Lanczos)
๐œ€1 ≡ 2๐œ€ N+2๐‘ +5 ๐œƒ + 4๐‘ +9 ๐œ + 10๐‘ +16 Γ = ๐‘‚ ๐œ€๐‘๐œƒΓ ,
where ๐œŽ ≡ ๐ด
2,
๐œƒ๐œŽ ≡
๐ด
2,
(vs. ๐‘‚ ๐œ€๐‘๐œƒ for Lanczos)
๐œ๐œŽ ≡ max ๐ตโ„“
โ„“≤๐‘˜
2
36
The amplification term
• Our definition of amplification term before was
Γ ≡ max ๐‘Œโ„“+ 2 โˆ™ ๐‘Œโ„“ 2 ≤ 2๐‘ +1 โˆ™ max ๐œ… ๐‘Œโ„“
โ„“≤๐‘˜
โ„“≤๐‘˜
where we want ๐‘Œ |๐‘ฆ′| 2 ≤ Γ ๐‘Œ๐‘ฆ′ 2 to hold for the computed
basis ๐‘Œ and any coordinate vector ๐‘ฆ′ in every iteration.
• Better, more descriptive estimate for Γ updated possible
w/tighter bounds; requires some light bookkeeping
๐‘‡
• Example: for bounds on ๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1 and ๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1 , we
can use the definition
๐‘Œ๐‘˜ ๐‘ฅ 2
Γ๐‘˜,๐‘— =
max
′
′ ,๐‘ฃ ′ ,๐‘ฃ ′
๐‘ฅ∈{๐‘ค๐‘˜,๐‘— ,๐‘ข๐‘˜,๐‘—
๐‘Œ๐‘˜ ๐‘ฅ 2
๐‘˜,๐‘— ๐‘˜,๐‘—−1 }
37
Local Rounding Errors, Classical Lanczos
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1
Measured value
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1
Upper bound (Paige 1976)
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
38
Local Rounding Errors, CA-Lanczos, monomial basis, ๐‘  = 8
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1
Measured value
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1
Upper bound
Γ value
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
39
Local Rounding Errors, CA-Lanczos, Newton basis, ๐‘  = 8
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1
Measured value
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1
Upper bound
Γ value
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
40
Local Rounding Errors, CA-Lanczos, Chebyshev basis, ๐‘  = 8
๐›ฝ๐‘–+1 ๐‘ฃ๐‘–๐‘‡ ๐‘ฃ๐‘–+1
Measured value
๐‘‡
๐‘ฃ๐‘–+1
๐‘ฃ๐‘–+1 − 1
Upper bound
Γ value
Problem: 2D Poisson, ๐‘› = 256, ๐ด = 7.93, with random starting vector
41
Paige’s results for classical Lanczos
• Using bounds on local rounding errors in Lanczos, Paige showed that
1. The computed Ritz values always lie between the extreme
eigenvalues of ๐ด to within a small multiple of machine
precision.
2. At least one small interval containing an eigenvalue of ๐ด is
found by the ๐‘›th iteration.
3. The algorithm behaves numerically like Lanczos with full
reorthogonalization until a very close eigenvalue
approximation is found.
4. The loss of orthogonality among basis vectors follows a
rigorous pattern and implies that some Ritz values have
converged.
Do the same statements hold for CA-Lanczos?
42
Results for CA-Lanczos
• The answer is YES! …but
• Only if:
• ๐‘Œโ„“ is numerically full rank for 0 ≤ โ„“ ≤ ๐‘˜ and
2
• ๐œ€0 ≡ 2๐œ€ ๐‘›+11๐‘ +15 Γ ≤
• i.e.,
Γ2
1
12
≤ 24๐œ– ๐‘› + 11๐‘  + 15
−1
• Otherwise, e.g., can lose orthogonality due to computation
with rank-deficient basis
• How can we use this bound on Γ to design a better algorithm?
43
Ideas based on CA-Lanczos analysis
• Explicit basis orthogonalization
• Compute TSQR on ๐‘Œ๐‘˜ in outer loop, use ๐‘„ factor as basis
• Get Γ = 1 for only one reduction
• Dynamically changing basis size
• Run incremental condition estimate when computing ๐‘ -step basis;
−1
stop when Γ 2 ≤ 24๐œ– ๐‘› + 11๐‘ ′ + 15
is reached. Use this basis
of size ๐‘  ′ ≤ ๐‘  in this outer loop.
• Mixed precision
• For inner products, use precision ๐œ– (0) such that
−1
(0)
2
๐œ– ≤ 24Γ ๐‘› + 11๐‘  + 15
• For SpMVs, use precision ๐œ– (1) such that
๐œ– (1) ≤ 4Γ N+2๐‘ +5 ๐œƒ + 4๐‘ +9 ๐œ + 10๐‘ +16
−1
44
Summary
• For high performance iterative methods, both time per iteration
and number of iterations required are important.
• CA-KSMs offer asymptotic reductions in time per iteration, but
can have the effect of increasing number of iterations required
(or causing instability)
• For CA-KSMs to be practical, must better understand finite
precision behavior (theoretically and empirically), and develop
ways to improve the methods
• Ongoing efforts in: finite precision convergence analysis for
CA-KSMs, selective use of extended precision, development
of CA preconditioners, other new techniques.
45
Thank you!
contact: erin@cs.berkeley.edu
http://www.cs.berkeley.edu/~ecc2z/
Download