Communication-avoiding Krylov subspace methods in finite precision Erin Carson, UC Berkeley Advisor: James Demmel ICME LA/Opt Seminar, Stanford University December 4, 2014 Model Problem: 2D Poisson on 2562 grid, κ(๐ด) = 3.9๐ธ4, ๐ด 2 = 1.99. Equilibration (diagonal scaling) used. RHS set s.t. elements of true solution ๐ฅ๐ = 1/256. Residual replacement strategy of with vaninder Vorst andaccuracy Ye variants. (1999) We We can can combine extend residual this strategy replacement communication-avoiding inexpensive techniques Roundoff error cause decrease attainable of for This affect can becan worse forato “communication-avoiding” (๐ -step) improves attainable accuracy for classical methods maintaining Accuracy convergence can beKrylov improved rate closer for to minimal that of performance theKrylov classical method! cost! subspace methods. Krylov subspace methods – limits practice applicability! 2 What is communication? • Algorithms have two costs: communication and computation • Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel) sequential comm. parallel comm. • On modern computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency – Communication bottleneck a barrier to achieving scalability – Also expensive in terms of energy cost! • Towards exascale: we must redesign algorithms to avoid communication 3 Communication bottleneck in KSMs • A Krylov Subspace Method is a projection process onto the Krylov subspace ๐ฆ๐ ๐ด, ๐0 = span ๐0 , ๐ด๐0 , ๐ด2 ๐0 , … , ๐ด๐−1 ๐0 • Linear systems, eigenvalue problems, singular value problems, least squares, etc. • Best for: ๐ด large & very sparse, stored implicitly, or only approximation needed Projection process in each iteration: 1. Add a dimension to ๐ฆ๐ • Sparse Matrix-Vector Multiplication (SpMV) • Parallel: comm. vector entries w/ neighbors • Sequential: read ๐ด/vectors from slow memory 2. Orthogonalize (with respect to some โ๐ ) • Inner products • Parallel: global reduction • Sequential: multiple reads/writes to slow memory SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 4 Example: Classical conjugate gradient (CG) Given: initial approximation ๐ฅ0 for solving ๐ด๐ฅ = ๐ Let ๐0 = ๐0 = ๐ − ๐ด๐ฅ0 for ๐ = 1, 2, … , until convergence do ๐ผ๐−1 = ๐ ๐๐−1 ๐๐−1 ๐ ๐๐−1 ๐ด๐๐−1 ๐ฅ๐ = ๐ฅ๐−1 + ๐ผ๐−1 ๐๐−1 ๐๐ = ๐๐−1 − ๐ผ๐−1 ๐ด๐๐−1 ๐ฝ๐ = ๐๐ ๐๐ ๐ ๐ ๐๐−1 ๐๐−1 SpMVs and inner products require communication in each iteration! ๐๐ = ๐๐ + ๐ฝ๐ ๐๐−1 end for 5 Communication-avoiding CA-KSMs • Krylov methods can be reorganized to reduce communication cost by ๐ถ(๐) • Many CA-KSMs (๐ -step KSMs) in the literature: CG, GMRES, Orthomin, MINRES, Lanczos, Arnoldi, CGS, Orthodir, BICG, CGS, BICGSTAB, QMR Related references: (Van Rosendale, 1983), (Walker, 1988), (Leland, 1989), (Chronopoulos and Gear, 1989), (Chronopoulos and Kim, 1990, 1992), (Chronopoulos, 1991), (Kim and Chronopoulos, 1991), (Joubert and Carey, 1992), (Bai, Hu, Reichel, 1991), (Erhel, 1995), GMRES (De Sturler, 1991), (De Sturler and Van der Vorst, 1995), (Toledo, 1995), (Chronopoulos and Kinkaid, 2001), (Hoemmen, 2010), (Philippe and Reichel, 2012), (C., Knight, Demmel, 2013), (Feuerriegel and Bücker, 2013). • Reduction in communication can translate to speedups on practical problems • Recent results: ๐. ๐x speedup with ๐ = 4 for CA-BICGSTAB in GMG bottom-solve; 2.5x in overall GMG solve (24,576 cores, ๐ = 10243 on Cray XE6) (Williams, et al., 2014). 6 Benchmark timing breakdown • Plot: Net time spent on different operations over one MG solve using 24,576 cores • Hopper at NERSC (Cray XE6), 4 6-core Opteron chips per node, Gemini network, 3D torus • 3D Helmholtz equation • CS-BICGSTAB with ๐ = 4 7 CA-CG overview Starting at iteration ๐ ๐ for ๐ ≥ 0, ๐ > 0, it can be shown that to compute the next ๐ steps (iterations ๐ ๐ + ๐ where ๐ ≤ ๐ ), ๐๐ ๐+๐ , ๐๐ ๐+๐ , ๐ฅ๐ ๐+๐ − ๐ฅ๐ ๐ ∈ ๐ฆ๐ +1 ๐ด, ๐๐ ๐ + ๐ฆ๐ ๐ด, ๐๐ ๐ . 1. Compute basis matrix ๐๐ = ๐0 ๐ด ๐๐ ๐ , … , ๐๐ ๐ด ๐๐ ๐ , ๐0 ๐ด ๐๐ ๐ , … , ๐๐ −1 ๐ด ๐๐ ๐ of dimension ๐-by-(2๐ +1), where ๐๐ is a polynomial of degree ๐. This gives the recurrence relation ๐ด๐๐ = ๐๐ ๐ต๐ , where ๐ต๐ is (2๐ +1)-by-(2๐ +1) and is 2x2 block diagonal with upper Hessenberg blocks, and ๐๐ is ๐๐ with columns ๐ +1 and 2๐ +1 set to 0. • Communication cost: Same latency cost as one SpMV using matrix powers kernel (e.g., Hoemmen et al., 2007), assuming sufficient sparsity structure (low diameter) 8 The matrix powers kernele Matrix Powers Avoids communication: • In serial, by exploiting temporal locality: • Reading ๐ด, reading ๐ = [๐ฅ, ๐ด๐ฅ, ๐ด2๐ฅ, … , ๐ด๐ ๐ฅ] • In parallel, by doing only 1 ‘expand’ phase (instead of ๐ ). • Requires sufficiently low ‘surface-to-volume’ ratio Also works for Tridiagonal Example: general graphs! black = local elements red = 1-level dependencies green = 2-level dependencies blue = 3-level dependencies A3v A2v Av v Sequential A3v A2v Av v Parallel 9 CA-CG overview 2. Orthogonalize: Encode dot products between basis vectors by computing Gram matrix ๐บ๐ = ๐๐๐ ๐๐ of dimension (2๐ +1)-by- 2๐ +1 (or compute Tall-Skinny QR) • Communication cost: of one global reduction 3. Perform ๐ iterations of updates • Using ๐๐ and ๐บ๐ , this requires no communication • Represent ๐-vectors by their length-(2๐ + 1) coordinates in ๐๐ : ๐ฅ๐ ๐+๐ − ๐ฅ๐ ๐ = ๐๐ ๐ฅ๐′ , ๐๐ ๐+๐ = ๐๐ ๐๐′ , ๐๐ ๐+๐ = ๐๐ ๐๐′ • Perform ๐ iterations of updates on coordinate vectors 4. Basis change to recover CG vectors ′ ′ ′ • Compute ๐ฅ๐ ๐+๐ −๐ฅ๐ ๐ , ๐๐ ๐+๐ , ๐๐ ๐+๐ = ๐๐ [๐ฅ๐,๐ , ๐๐,๐ , ๐๐,๐ ] locally, no communication 10 No communication in inner loop! CG for ๐ = 1, … , ๐ do ๐ผ๐ ๐+๐−1 = CA-CG Compute ๐๐ such that ๐ด๐๐ = ๐๐ ๐ต๐ Compute ๐บ๐ = ๐๐๐ ๐๐ for ๐ = 1, … , ๐ do ๐ ๐๐ ๐+๐−1 ๐๐ ๐+๐−1 ๐ผ๐ ๐+๐−1 = ๐ ๐๐ ๐+๐−1 ๐ด๐๐ ๐+๐−1 ′๐ ′ ๐๐,๐−1 ๐บ๐ ๐๐,๐−1 ′๐ ′ ๐๐,๐−1 ๐บ๐ ๐ต๐ ๐๐,๐−1 ๐ฅ๐ ๐+๐ = ๐ฅ๐ ๐+๐−1 + ๐ผ๐ ๐+๐−1 ๐๐ ๐+๐−1 ′ ′ ′ ๐ฅ๐,๐ = ๐ฅ๐,๐−1 + ๐ผ๐ ๐+๐−1 ๐๐,๐−1 ๐๐ ๐+๐ = ๐๐ ๐+๐−1 − ๐ผ๐ ๐+๐−1 ๐ด๐๐ ๐+๐−1 ′ ′ ′ ๐๐,๐ = ๐๐,๐−1 − ๐ผ๐ ๐+๐−1 ๐ต๐ ๐๐,๐−1 ๐ฝ๐ ๐+๐ = ๐ ๐๐ ๐+๐ ๐๐ ๐+๐ ๐ฝ๐ ๐+๐ = ๐ ๐๐ ๐+๐−1 ๐๐ ๐+๐−1 ′๐ ′ ๐๐,๐−1 ๐บ๐ ๐๐,๐−1 ′ ′ ′ ๐๐,๐ = ๐๐,๐ + ๐ฝ๐ ๐+๐ ๐๐,๐−1 ๐๐ ๐+๐ = ๐๐ ๐+๐ +๐ฝ๐ ๐+๐ ๐๐ ๐+๐−1 end for ′๐ ′ ๐๐,๐ ๐บ๐ ๐๐,๐ end for ′ ′ ′ ๐ฅ๐ ๐+๐ −๐ฅ๐ ๐ , ๐๐ ๐+๐ , ๐๐ ๐+๐ = ๐๐ [๐ฅ๐,๐ , ๐๐,๐ , ๐๐,๐ ] Krylov methods in finite precision • Finite precision errors have effects on algorithm behavior (known to Lanczos (1950); see, e.g., Meurant and Strakoš (2006) for in-depth survey) • Lanczos: • Basis vectors lose orthogonality • Appearance of multiple Ritz approximations to some eigenvalues • Delays convergence of Ritz values to other eigenvalues • CG: • Residual vectors (proportional to Lanczos basis vectors) lose orthogonality • Appearance of duplicate Ritz values delays convergence of CG approximate solution • Residual vectors ๐๐ and true residuals ๐ − ๐ด๐ฅ๐ deviate • Limits the maximum attainable accuracy of approximate solution 12 CA-Krylov methods in finite precision • CA-KSMs mathematically equivalent to classical KSMs • But convergence delay and loss of accuracy gets worse with increasing ๐! • Obstacle to solving practical problems • Decrease in attainable accuracy → some problems that KSM can solve can’t be solved with CA variant • Delay of convergence → if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected! 13 This raises the questions… For CA-KSMs, For solving linear systems: • How bad can the effects of roundoff error be? • Bound on maximum attainable accuracy for CA-CG and CABICG • And what can we do about it? • Residual replacement strategy: uses bound to improve accuracy to ๐(๐) ๐ด ๐ฅ in CA-CG and CA-BICG 14 This raises the questions… For CA-KSMs, For eigenvalue problems: • How bad can the effects of roundoff error be? • Extension of Chris Paige’s analysis of Lanczos to CA variants: • Bounds on local rounding errors in CA-Lanczos • Bounds on accuracy and convergence of Ritz values • Loss of orthogonality ๏ฎ convergence of Ritz values • And what can we do about it? • Some ideas on how to use this new analysis… • Basis orthogonalization • Dynamically select/change basis size (๐ parameter) • Mixed precision variants 15 Maximum attainable accuracy of CG • In classical CG, iterates are updated by ๐ฅ๐ = ๐ฅ๐−1 + ๐ผ๐−1 ๐๐−1 and ๐๐ = ๐๐−1 − ๐ผ๐−1 ๐ด๐๐−1 • Formulas for ๐ฅ๐ and ๐๐ do not depend on each other - rounding errors cause the true residual, ๐ − ๐ด๐ฅ๐ , and the updated residual, ๐๐ , to deviate • The size of the true residual is bounded by ๐ − ๐ด๐ฅ๐ ≤ ๐๐ + ๐ − ๐ด๐ฅ๐ − ๐๐ • When ๐๐ โซ ๐ − ๐ด๐ฅ๐ − ๐๐ , ๐๐ and ๐ − ๐ด๐ฅ๐ have similar magnitude • When ๐๐ → 0, ๐ − ๐ด๐ฅ๐ depends on ๐ − ๐ด๐ฅ๐ − ๐๐ • Many results on attainable accuracy, e.g.: Greenbaum (1989, 1994, 1997), Sleijpen, van der Vorst and Fokkema (1994), Sleijpen, van der Vorst and Modersitzki (2001), Björck, Elfving and Strakoš (1998) and Gutknecht and Strakoš (2000). 16 Example: Comparison of convergence of true and updated residuals for CG vs. CACG using a monomial basis, for various ๐ values Model problem (2D Poisson on 2562 grid) 17 • Better conditioned polynomial bases can be used instead of monomial. • Two common choices: Newton and Chebyshev - see, e.g., (Philippe and Reichel, 2012). • Behavior closer to that of classical CG • But can still see some loss of attainable accuracy compared to CG • Clearer for higher ๐ values 18 Residual replacement strategy for CG • van der Vorst and Ye (1999): Improve accuracy by replacing updated residual ๐๐ by the true residual ๐ − ๐จ๐๐ in certain iterations, combined with group update. • Choose when to replace ๐๐ with ๐ − ๐ด๐ฅ๐ to meet two constraints: 1. Replace often enough so that at termination, ๐ − ๐ด๐ฅ๐ − ๐๐ is small relative to ๐๐ ๐ด ๐ฅ๐ 2. Don’t replace so often that original convergence mechanism of updated residuals is destroyed (avoid large perturbations to finite precision CG recurrence) 19 Residual replacement condition for CG Tong and Ye (2000): In finite precision classical CG, in iteration ๐, the computed residual and search direction vectors satisfy ๐๐ = ๐๐−1 − ๐ผ๐−1 ๐ด๐๐−1 + ๐๐ , Then in matrix form, 1 ๐๐+1 ๐ ๐ด๐๐ = ๐๐ ๐๐ − ′ ๐๐+1 + ๐น๐ ๐ผ๐ ๐0 ๐๐ = ๐๐ + ๐ฝ๐ ๐๐−1 + ๐๐ with ๐๐ = ๐0 ๐๐ ,…, ๐0 ๐๐ ′ = ๐ ๐ ๐ −1 ๐ and ๐น = where ๐๐ is invertible and tridiagonal, ๐ผ๐ ๐ ๐ 1 ๐ ๐0 , … , ๐๐ , with ๐ด๐๐ 1 ๐๐+1 ๐ฝ๐ ๐๐ ๐๐ = + − ๐๐ ๐ผ๐ ๐๐ ๐ผ๐−1 ๐๐ 20 Residual replacement condition for CG Tong and Ye (2000): If sequence ๐๐ satisfies ๐ด๐๐ = ๐๐ ๐๐ − 1 ๐๐+1 ๐ ๐๐+1 ′ ๐ผ๐ ๐0 + ๐น๐ , ๐๐ = ๐ด๐๐ ๐๐ 1 ๐๐+1 + ๐ผ๐ ๐๐ ๐ฝ๐ ๐๐ − ๐ผ๐−1 ๐๐ then ๐๐+1 ≤ 1 + ๐ฆ๐ where ๐ฆ๐ = min ๐∈๐ซ๐ ,๐ 0 =1 โ๐ ๐ด + Δ๐ด๐ ๐1 โ + +. ๐ด๐๐ − ๐น๐ ๐๐−1 โ๐๐+1 โ and Δ๐ด๐ = −๐น๐ ๐๐ • As long as ๐๐ satisfies recurrence, bound on its norm holds regardless of how it is generated • Can replace ๐๐ by fl(๐ − ๐ด๐ฅ๐ ) and still expect convergence whenever ๐๐ = fl ๐ − ๐ด๐ฅ๐ − ๐๐−1 − ๐ผ๐−1 ๐ด๐๐−1 is not too large relative to ๐๐ and ๐๐−1 (van der Vorst & Ye, 1999). 21 Residual replacement strategy for CG • (van der Vorst and Ye, 1999): Use computable bound for โ๐ − ๐ด๐ฅ๐ − Pseudo-code for residual replacement with group update for CG: if end ๐๐−1 ≤ ๐ ๐๐−1 ๐๐ง๐ ๐๐ > ๐ ๐๐ ๐๐ง๐ ๐๐ > 1.1๐๐๐๐๐ก ๐ง = ๐ง + ๐ฅ๐ group update of approximate solution ๐ฅ๐ = 0 set residual to true residual ๐๐ = ๐ − ๐ด๐ง ๐๐๐๐๐ก = ๐๐ = ๐ ๐ ๐ด ๐ง + ๐๐ • Assuming a bound on condition number of ๐ด, if updated residual converges to ๐(๐) ๐ด ๐ฅ , true residual reduced to ๐(๐) ๐ด ๐ฅ 22 Sources of roundoff error in CA-CG Computing the ๐ -step Krylov basis: ๐ด๐๐ = ๐๐ ๐ต๐ + โ๐๐ Error in computing ๐ -step basis Updating coordinate vectors in the inner loop: ′ ′ ′ ๐ฅ๐,๐ = ๐ฅ๐,๐−1 + ๐๐,๐−1 + ๐๐,๐ ′ ′ ′ ๐๐,๐ = ๐๐,๐−1 − ๐ต๐ ๐๐,๐−1 + ๐๐,๐ Error in updating coefficient vectors ′ ′ with ๐๐,๐−1 = fl(๐ผ๐ ๐+๐−1 ๐๐,๐−1 ) Recovering CG vectors for use in next outer loop: ′ ๐ฅ๐ ๐+๐ = ๐๐ ๐ฅ๐,๐ + ๐ฅ๐ ๐ + ๐๐ ๐+๐ ′ ๐๐ ๐+๐ = ๐๐ ๐๐,๐ + ๐๐ ๐+๐ Error in basis change 23 Maximum attainable accuracy of CA-CG • We can write the deviation of the true and updated residuals in terms of these errors: ๐−๐ด๐ฅ๐ ๐+๐ −๐๐ ๐+๐ = ๐−๐ด๐ฅ0 −๐0 ๐−1 − ๐ ′ ๐ด๐โ ๐โ,๐ + ๐โ ๐โ,๐ − Δ๐โ ๐โ,๐−1 ๐ด๐๐ โ+๐ + ๐๐ โ+๐ + โ=0 ๐=1 ๐ ′ ๐ด๐๐ ๐๐,๐ + ๐๐ ๐๐,๐ − Δ๐๐ ๐๐,๐−1 −๐ด๐๐ ๐+๐ − ๐๐ ๐+๐ − ๐=1 • Using standard rounding error results, this allows us to obtain an upper bound on ๐−๐ด๐ฅ๐ ๐+๐ −๐๐ ๐+๐ . 24 A computable bound • We extend van der Vorst and Ye’s residual replacement strategy to CA-CG • Making use of the bound for ๐−๐ด๐ฅ๐ ๐+๐ −๐๐ ๐+๐ in CA-CG, update error estimate ๐๐ ๐+๐ by: Extra๐computation all lower order terms, communication ๐ ๐ถ(๐ ) flops per ๐ iterations; ≤1 reduction per ๐ iterations ๐ถ(๐ + ๐by ) flops perfactor ๐ iterations; no communication only once increased at Estimated most of 2. ๐๐ ๐+๐ ≡ ๐๐ ๐+๐−1 +๐ 4+๐′ +๐ ๐ด ๐ด ′ ๐๐ โ ๐ฅ๐,๐ + ๐ฅ๐ ๐+๐ + 2+2๐ ′ ๐ด ′ ๐๐ โ ๐ต๐ โ ๐ฅ๐,๐ + ′ ๐๐ โ ๐๐,๐ ′ ′ ๐๐ โ ๐ฅ๐,๐ +๐ ′ ๐๐ โ ๐๐,๐ ,๐=๐ 0, otherwise where ๐ ′ = max ๐, 2๐ + 1 . 25 Residual replacement for CA-CG • Use the same replacement condition as van der Vorst and Ye (1999): ๐๐ ๐+๐−1 ≤ ๐ ๐๐ ๐+๐−1 ๐๐ง๐ ๐๐ ๐+๐ > ๐ ๐๐ ๐+๐ ๐๐ง๐ ๐๐ ๐+๐ > 1.1๐๐๐๐๐ก • (possible we could develop a better replacement condition based on perturbation in finite precision CA-CG recurrence) Pseudo-code for residual replacement with group update for CA-CG: if ๐๐ ๐+๐−1 ≤ ๐ ๐๐ ๐+๐−1 ๐๐ง๐ ′ ๐ง = ๐ง + ๐๐ ๐ฅ๐,๐ + ๐ฅ๐ ๐ ๐ฅ๐ ๐+๐ = 0 ๐๐ ๐+๐ = ๐ − ๐ด๐ง ๐๐ ๐+๐ > ๐ ๐๐ ๐+๐ ๐๐ง๐ ๐๐ ๐+๐ > 1.1๐๐๐๐๐ก ๐๐๐๐๐ก = ๐๐ ๐+๐ = ๐ 1 + 2๐′ ๐ด ๐ง + ๐๐ ๐+๐ ′ ๐๐ ๐+๐ = ๐๐ ๐๐,๐ break from inner loop and begin new outer loop end 26 27 Residual Replacement Indices Total Number of Reductions s=4 CACG Mono. CACG Newt. CACG Cheb. 354 203 353 196 365 197 340 102 353 99 326 68 346 71 401, 517 s=8 224, 334, 157 s=12 135, 2119 557 CG 355 669 •• #Inreplacements small compared to total addition to attainable accuracy, iterations ๏ฎ RR doesn’t significantly affect convergence rate is incredibly communication savings! important in practical implementations •• Can have bothrate speed and accuracy! Convergence depends on basis 28 consph8, FEM/Spheres (from UFSMC) n = 8.3 ⋅ 104 , ๐๐๐ง = 6.0 ⋅ 106 , ๐ ๐ด = 9.7 ⋅ 103 , ๐ด = 9.7 After ๐ =8 # Replacements Class. 2 M 12 N 2 C 2 # Replacements ๐ =12 Before Class. 2 M 0 N 4 C 3 29 xenon1, materials (from UFSMC) n = 4.9 ⋅ 104 , ๐๐๐ง = 1.2 ⋅ 106 , ๐ ๐ด = 3.3 ⋅ 104 , ๐ด = 3.2 After ๐ =4 # Replacements Class. 2 M 1 N 1 C 1 # Replacements ๐ =8 Before Class. 2 M 5 N 2 C 1 30 Preconditioning for CA-KSMs • Tradeoff: speed up convergence, but increase time per iteration due to communication! • For each specific app, must evaluate tradeoff between preconditioner quality and sparsity of the system • Good news: many preconditioners allow communication-avoiding approach • Block Jacobi – block diagonals • Sparse Approx. Inverse (SAI) – same sparsity as ๐ด; recent work for CA-BICGSTAB by Mehri (2014) • Polynomial preconditioning (Saad, 1985) • HSS preconditioning for banded matrices (Hoemmen, 2010), (Knight, C., Demmel, 2014) • CA-ILU(0) – recent work by Moufawad and Grigori (2013) • Deflation for CA-CG (C., Knight, Demmel, 2014), based on Deflated CG of (Saad et al., 2000); for CA-GMRES (Yamazaki et al., 2014) 31 Deflated CA-CG, model problem Monomial Basis, ๐ ๐ ==10 4 8 Matrix: 2D Laplacian(512), ๐ = 262,144. Right hand side set such that true solution has entries ๐ฅ๐ = 1/ ๐. Deflated CG algorithm (DCG) from (Saad et al., 2000). 32 Eigenvalue problems: CA-Lanczos Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector As ๐ increases, both convergence rate and accuracy to which we can find approximate eigenvalues within ๐ iterations decreases. 33 Eigenvalue problems: CA-Lanczos Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector Improved by better-conditioned bases, but still see decreased convergence rate and accuracy as ๐ grows. 34 Paige’s Lanczos convergence analysis ๐ + ๐ฟ๐ ๐ด๐๐ = ๐๐ ๐๐ + ๐ฝ๐+1 ๐ฃ๐+1 ๐๐ ๐ ๐๐ = ๐ฃ1 , … , ๐ฃ๐ , ๐ฟ ๐๐ = ๐ฟ ๐ฃ1 , … , ๐ฟ ๐ฃ๐ , ๐๐ = ๐ผ1 ๐ฝ2 ๐ฝ2 โฑ โฑ โฑ โฑ ๐ฝ๐ ๐ฝ๐ ๐ผ๐ for ๐ ∈ {1, … , ๐}, Classic Lanczos rounding error result of Paige (1976): 2 ๐ฝ๐+1 where ๐ ≡ ๐ด 2, ๐๐ ≡ ๐ด ๐ฟ ๐ฃ๐ 2 ≤ ๐1 ๐ ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 ≤ 2๐0 ๐ ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 ≤ ๐0 2 + ๐ผ๐2 + ๐ฝ๐2 − ๐ด๐ฃ๐ 22 ≤ 4๐ 3๐0 + ๐1 ๐ 2 2 , ๐0 ≡ 2๐ ๐ + 4 , and ๐1 ≡ 2๐ ๐๐ + 7 ๐0 = ๐ ๐๐ ๐1 = ๐ ๐๐๐ ๏ฎ These results form the basis for Paige’s influential results in (Paige, 1980). 35 CA-Lanczos convergence analysis Let Γ ≡ max ๐โ+ โ≤๐ 2 โ ๐โ 2 ≤ 2๐ +1 โ max ๐ ๐โ . โ≤๐ for ๐ ∈ {1, … , ๐=๐ ๐+๐}, For CA-Lanczos, we have: 2 ๐ฝ๐+1 ๐ฟ ๐ฃ๐ 2 ≤ ๐1 ๐ ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 ≤ 2๐0 ๐ ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 ≤ ๐0 2 + ๐ผ๐2 + ๐ฝ๐2 − ๐ด๐ฃ๐ 22 ≤ 4๐ 3๐0 + ๐1 ๐ 2 ๐0 ≡ 2๐ ๐+11๐ +15 Γ 2 = ๐ ๐๐Γ 2 , (vs. ๐ ๐๐ for Lanczos) ๐1 ≡ 2๐ N+2๐ +5 ๐ + 4๐ +9 ๐ + 10๐ +16 Γ = ๐ ๐๐๐Γ , where ๐ ≡ ๐ด 2, ๐๐ ≡ ๐ด 2, (vs. ๐ ๐๐๐ for Lanczos) ๐๐ ≡ max ๐ตโ โ≤๐ 2 36 The amplification term • Our definition of amplification term before was Γ ≡ max ๐โ+ 2 โ ๐โ 2 ≤ 2๐ +1 โ max ๐ ๐โ โ≤๐ โ≤๐ where we want ๐ |๐ฆ′| 2 ≤ Γ ๐๐ฆ′ 2 to hold for the computed basis ๐ and any coordinate vector ๐ฆ′ in every iteration. • Better, more descriptive estimate for Γ updated possible w/tighter bounds; requires some light bookkeeping ๐ • Example: for bounds on ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 and ๐ฃ๐+1 ๐ฃ๐+1 − 1 , we can use the definition ๐๐ ๐ฅ 2 Γ๐,๐ = max ′ ′ ,๐ฃ ′ ,๐ฃ ′ ๐ฅ∈{๐ค๐,๐ ,๐ข๐,๐ ๐๐ ๐ฅ 2 ๐,๐ ๐,๐−1 } 37 Local Rounding Errors, Classical Lanczos ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 Measured value ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 Upper bound (Paige 1976) Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector 38 Local Rounding Errors, CA-Lanczos, monomial basis, ๐ = 8 ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 Measured value ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 Upper bound Γ value Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector 39 Local Rounding Errors, CA-Lanczos, Newton basis, ๐ = 8 ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 Measured value ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 Upper bound Γ value Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector 40 Local Rounding Errors, CA-Lanczos, Chebyshev basis, ๐ = 8 ๐ฝ๐+1 ๐ฃ๐๐ ๐ฃ๐+1 Measured value ๐ ๐ฃ๐+1 ๐ฃ๐+1 − 1 Upper bound Γ value Problem: 2D Poisson, ๐ = 256, ๐ด = 7.93, with random starting vector 41 Paige’s results for classical Lanczos • Using bounds on local rounding errors in Lanczos, Paige showed that 1. The computed Ritz values always lie between the extreme eigenvalues of ๐ด to within a small multiple of machine precision. 2. At least one small interval containing an eigenvalue of ๐ด is found by the ๐th iteration. 3. The algorithm behaves numerically like Lanczos with full reorthogonalization until a very close eigenvalue approximation is found. 4. The loss of orthogonality among basis vectors follows a rigorous pattern and implies that some Ritz values have converged. Do the same statements hold for CA-Lanczos? 42 Results for CA-Lanczos • The answer is YES! …but • Only if: • ๐โ is numerically full rank for 0 ≤ โ ≤ ๐ and 2 • ๐0 ≡ 2๐ ๐+11๐ +15 Γ ≤ • i.e., Γ2 1 12 ≤ 24๐ ๐ + 11๐ + 15 −1 • Otherwise, e.g., can lose orthogonality due to computation with rank-deficient basis • How can we use this bound on Γ to design a better algorithm? 43 Ideas based on CA-Lanczos analysis • Explicit basis orthogonalization • Compute TSQR on ๐๐ in outer loop, use ๐ factor as basis • Get Γ = 1 for only one reduction • Dynamically changing basis size • Run incremental condition estimate when computing ๐ -step basis; −1 stop when Γ 2 ≤ 24๐ ๐ + 11๐ ′ + 15 is reached. Use this basis of size ๐ ′ ≤ ๐ in this outer loop. • Mixed precision • For inner products, use precision ๐ (0) such that −1 (0) 2 ๐ ≤ 24Γ ๐ + 11๐ + 15 • For SpMVs, use precision ๐ (1) such that ๐ (1) ≤ 4Γ N+2๐ +5 ๐ + 4๐ +9 ๐ + 10๐ +16 −1 44 Summary • For high performance iterative methods, both time per iteration and number of iterations required are important. • CA-KSMs offer asymptotic reductions in time per iteration, but can have the effect of increasing number of iterations required (or causing instability) • For CA-KSMs to be practical, must better understand finite precision behavior (theoretically and empirically), and develop ways to improve the methods • Ongoing efforts in: finite precision convergence analysis for CA-KSMs, selective use of extended precision, development of CA preconditioners, other new techniques. 45 Thank you! contact: erin@cs.berkeley.edu http://www.cs.berkeley.edu/~ecc2z/