pptx

Communication-Avoiding Krylov Subspace Methods in Finite Precision Erin Carson and James Demmel U.C. Berkeley Bay Area Scientific Computing Day December 11, 2013 What is Communication? • Algorithms have two costs: communication and computation – Communication : moving data between levels of memory hierarchy (sequential), between processors (parallel) sequential comm. parallel comm. • On modern computers, communication is expensive, computation is cheap – Flop time << 1/bandwidth << latency – Communication bottleneck a barrier to achieving scalability – Communication is also expensive in terms of energy cost • To achieve exascale, we must redesign algorithms to avoid communication 2 Krylov Subspace Iterative Methods • Methods for solving linear systems, eigenvalue problems, singular value problems, least squares, etc. – Best choice when 𝐴 is very large and very sparse, stored implicitly, or only approximate answer needed • Based on projection onto the expanding Krylov subspace 𝒦𝑚 𝐴, 𝑣 = span{𝑣, 𝐴𝑣, … , 𝐴𝑚−1 𝑣} • Used in numerous application areas – Image/contour detection, modal analysis, solving PDEs - combustion, climate, astrophysics, etc… • Named among “Top 10” algorithmic ideas of the 20th century – Along with FFT, fast multipole, quicksort, Monte Carlo, simplex method, etc. (SIAM News, Vol.33, No. 4, 2000) 3 Communication Bottleneck in KSMs To carry out the projection process, in each iteration, we 1. Add a dimension to the Krylov subspace 𝒦𝑚 – Requires Sparse Matrix-Vector Multiplication (SpMV) • Parallel: communicate vector entries with neighbors • Sequential: read 𝐴 (and 𝑛-vectors) from slow memory 2. Orthogonalize – Vector operations (inner products, axpys) • Parallel: global reduction • Sequential: multiple reads/writes to slow memory SpMV orthogonalize Dependencies between communication-bound kernels in each iteration limit performance! 4 Example: Classical Conjugate Gradient (CG) Given: initial approximation 𝑥0 for solving 𝐴𝑥 = 𝑏 Let 𝑝0 = 𝑟0 = 𝑏 − 𝐴𝑥0 for 𝑚 = 0, 1, … , until convergence do 𝛼𝑚 = 𝑇𝑟 𝑟𝑚 𝑚 𝑇 𝐴𝑝 𝑝𝑚 𝑚 𝑥𝑚+1 = 𝑥𝑚 + 𝛼𝑚 𝑝𝑚 𝑟𝑚+1 = 𝑟𝑚 − 𝛼𝑚 𝐴𝑝𝑚 𝛽𝑚+1 = 𝑇 𝑟𝑚+1 𝑟𝑚+1 𝑇𝑟 𝑟𝑚 𝑚 SpMVs and dot products require communication in each iteration 𝑝𝑚+1 = 𝑟𝑚+1 + 𝛽𝑚+1 𝑝𝑚 end for 5 Communication-Avoiding KSMs • CA-KSMs, based on 𝑠-step formulations, reduce communication by 𝑶(𝒔) – First instance: s-step CG by Van Rosendale (1983); for thorough overview see thesis of Hoemmen (2010) • Main idea: Unroll iteration loop by a factor of 𝑠. By induction, 𝑥𝑠𝑘+𝑗 − 𝑥𝑠𝑘 , 𝑟𝑠𝑘+𝑗 , 𝑝𝑠𝑘+𝑗 ∈ 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘 for 𝑗 ∈ 0, … , 𝑠 − 1 1. Compute basis 𝑉𝑘 for 𝒦𝑠+1 𝐴, 𝑝𝑠𝑘 + 𝒦𝑠 𝐴, 𝑟𝑠𝑘 upfront • If 𝐴𝑠 is well partitioned, requires reading 𝐴/communicating Outer loop: vectors only once (via matrix powers kernel) Communication 2. Orthogonalize: Encode dot products between basis vectors step 𝑇 with Gram matrix 𝐺𝑘 = 𝑉𝑘 𝑉𝑘 (or compute Tall-Skinny QR) • Communication cost of one global reduction 3. Perform s iterations of updates Using 𝑉𝑘 and 𝐺𝑘 , this requires no communication Represent 𝑛-vectors by their coordinates in 𝑉𝑘 : 𝑥𝑠𝑘+𝑗 − 𝑥𝑠𝑘 = 𝑉𝑘 𝑥𝑗′ , 𝑟𝑠𝑘+𝑗 = 𝑉𝑘 𝑟𝑗′ , 𝑝𝑠𝑘+𝑗 = 𝑉𝑘 𝑝𝑗′ Inner loop: Computation steps, no communication! 6 Example: CA-Conjugate Gradient Given: initial approximation 𝑥0 for solving 𝐴𝑥 = 𝑏 via CA Matrix Let 𝑝0 = 𝑟0 = 𝑏 − 𝐴𝑥0 for k = 0, 1, … , until convergence do Powers Kernel Compute 𝑉𝑘 such that 𝐴𝑉𝑘 = 𝑉𝑘 𝐵𝑘 , compute 𝐺𝑘 = 𝑉𝑘𝑇 𝑉𝑘 Let 𝑥0′ = 02𝑠+1 , 𝑟0′ = 𝑒𝑠+2 , 𝑝0′ = 𝑒1 for 𝑗 = 0, … , 𝑠 − 1 do Global reduction 𝛼𝑠𝑘+𝑗 = 𝑟𝑗′ 𝑝𝑗′ 𝑇 𝑇 𝐺𝑘 𝐵𝑘 𝑝𝑗′ ′ 𝑥𝑗+1 = 𝑥𝑗′ + 𝛼𝑠𝑘+𝑗 𝑝𝑗′ ′ 𝑟𝑗+1 = 𝑟𝑗′ − 𝛼𝑠𝑘+𝑗 𝐵𝑘 𝑝𝑗′ 𝛽𝑠𝑘+𝑗+1 = to compute G 𝐺𝑘 𝑟𝑗′ ′ 𝑟𝑗+1 𝑟𝑗′ 𝑇 𝑇 ′ 𝐺𝑘 𝑟𝑗+1 Local computations within inner loop require no communication! 𝐺𝑘 𝑟𝑗′ ′ ′ 𝑝𝑗+1 = 𝑟𝑗+1 + 𝛽𝑠𝑘+𝑗+1 𝑝𝑗′ end for Compute 𝑥𝑠𝑘+𝑠 = 𝑉𝑘 𝑥𝑠′ + 𝑥𝑠𝑘 , 𝑟𝑠𝑘+𝑠 = 𝑉𝑘 𝑟𝑠′ , 𝑝𝑠𝑘+𝑠 = 𝑉𝑘 𝑝𝑠′ end for 7 Behavior of CA-KSMs in Finite Precision • • • • Runtime = (time per iteration) x (number of iterations until convergence) Communication-avoiding approach can reduce time per iteration by O(s) But how do convergence rates compare to classical methods? CA-KSMs are mathematically equivalent to classical KSMs, but can behave much differently in finite precision! – Roundoff errors have two discernable effects: 1. Delay of convergence → if # iterations increases more than time per iteration decreases due to CA techniques, no speedup expected 2. Decrease in attainable accuracy due to deviation of the true residual, 𝑏 − 𝐴𝑥𝑖 , and the updated residual, 𝑟𝑖 → some problems that KSM can solve can’t be solved with CA variant – Roundoff error bounds generally grow with increasing 𝑠 • Obstacles to the practical use of CA-KSMs 8 CA-CG Convergence, s = 4 CA-CG Convergence, s = 8 Loss of accuracy due to roundoff Slower convergence due to roundoff CA-CG Convergence, s = 16 CG true CG updated CA-CG (monomial) true CA-CG (monomial) updated At s = 16, monomial basis is rank deficient! Method breaks down! Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) ≈ 10^4 0 Improving Convergence Rate and Accuracy • We will discuss two techniques for alleviating the effects of roundoff: 1. Improve convergence by using better-conditioned polynomial bases 2. Use a residual replacement strategy to improve attainable accuracy 10 Choosing a Polynomial Basis • Recall: in each outer loop of CA-CG, we compute bases for some Krylov subspace, 𝒦𝑚 𝐴, 𝑣 = span{𝑣, 𝐴𝑣, … , 𝐴𝑚−1 𝑣} • Simple loop unrolling leads to the choice of monomials 𝑣, 𝐴𝑣, … , 𝐴𝑠 𝑣 – Monomial basis condition number can grow exponentially with 𝑠 expected (near) linear dependence of basis vectors for modest 𝑠 values – Recognized early on that this negatively affects convergence (Chronopoulous & Swanson, 1995) • Improve basis condition number to improve convergence: Use different polynomials to compute a basis for the same subspace. • Two choices based on spectral information that usually lead to wellconditioned bases: – Newton polynomials – Chebyshev polynomials 11 Newton Basis • The Newton basis can be written 𝑣, 𝐴 − 𝜃1 𝑣, 𝐴 − 𝜃2 𝐴 − 𝜃1 𝑣, … , 𝐴 − 𝜃𝑠 … 𝐴 − 𝜃1 𝑣 where {𝜃1 , … , 𝜃𝑠 } are chosen to be approximate eigenvalues of 𝐴, ordered according to Leja ordering: 𝑗 𝜃1 = argmax𝑧∈𝐿0 𝑧 𝜃𝑗+1 = argmax𝑧∈𝐿𝑗 𝑧 − 𝑧𝑖 𝑖=0 where 𝐿𝑗 ≡ 𝜃𝑗+1 , 𝜃𝑗+2 , … , 𝜃𝑠 . • In practice: recover Ritz (Petrov) values from the first few iterations, iteratively refine eigenvalue estimates to improve basis • Used by many to improve s-step variants: e.g., Bai, Hu, and Reichel (1991), Erhel (1995), Hoemmen (2010) 12 Chebyshev Basis • Chebyshev polynomials frequently used because of minimax property: – Over all real polynomials of degree 𝑚 on interval 𝐼, the Chebyshev polynomial of degree 𝑚 minimizes the maximum absolute value on 𝐼 • Given ellipse enclosing spectrum of 𝐴 with foci at 𝑑 ± 𝑐, we can generate the scaled and shifted Chebyshev polynomials as: 𝜏𝑗 𝑧 = where 𝜏𝑗 𝑗≥0 𝑑−𝑧 𝑐 𝑑 𝜏𝑗 𝑐 𝜏𝑗 are the Chebyshev polynomials of the first kind • In practice, we can estimate 𝑑 and 𝑐 parameters from Ritz values recovered from the first few iterations • Used by many to improve s-step variants: e.g., de Sturler (1991), Joubert and Carey (1992), de Sturler and van der Vorst (2005) 13 CA-CG Convergence, s = 4 CA-CG Convergence, s = 8 CA-CG Convergence, s = 16 CG true CG updated CA-CG (monomial) true CA-CG (monomial) updated Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) ≈ 10^4 0 CA-CG Convergence, s = 4 CA-CG Convergence, s = 8 Better basis choice allows higher s values CG true CG updated CA-CG (monomial) true CA-CG (monomial) updated CA-CG (Newton) true CA-CG (Newton) updated CA-CG (Chebyshev) true CA-CG (Chebyshev) updated Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) ≈ 10^4 CA-CG Convergence, s = 16 But can still see loss of accuracy Explicit Orthogonalization • Use communication-avoiding TSQR to Monomial Basis, s=8 orthogonalize matrix powers output • Matrix powers kernel outputs basis 𝑉𝑘 such that 𝐴𝑉𝑘 = 𝑉𝑘 𝐵𝑘 • Use TSQR to compute 𝑉𝑘 = 𝑄𝑘 𝑅𝑘 (one reduction) • Use orthogonal basis 𝑄𝑘 instead of 𝑉𝑘 : 𝐴𝑄𝑘 = 𝑄𝑘 𝐵𝑘 with 𝐵𝑘 = 𝑅𝑘 𝐵𝑘 𝑅𝑘−1 – Gram matrix computation unnecessary since 𝑄𝑘𝑇 𝑄𝑘 = 𝐼 • Example: small model problem, 2D Can improve But can cause convergence rate of Poisson with 𝑛 = 400, CA-CG with loss of accuracy updated residual monomial basis and 𝑠 = 8 Accuracy in Finite Precision • In classical CG, iterates are updated by 𝑥𝑖 = 𝑥𝑖−1 + 𝑞𝑖−1 and 𝑟𝑖 = 𝑟𝑖−1 − 𝐴𝑞𝑖−1 • Formulas for 𝑥𝑖 and 𝑟𝑖 do not depend on each other - rounding errors cause the true residual, 𝑏 − 𝐴𝑥𝑖 , and the updated residual, 𝑟𝑖 , to deviate • The size of the deviation of residuals, 𝑏 − 𝐴𝑥𝑖 − 𝑟𝑖 , determines the maximum attainable accuracy – Write the true residual as 𝑏 − 𝐴𝑥𝑖 = 𝑟𝑖 − 𝑏 − 𝐴𝑥𝑖 − 𝑟𝑖 – Then the size of the true residual is bounded by 𝑏 − 𝐴𝑥𝑖 ≤ 𝑟𝑖 + 𝑏 − 𝐴𝑥𝑖 − 𝑟𝑖 – When 𝑟𝑖 ≫ 𝑏 − 𝐴𝑥𝑖 − 𝑟𝑖 , 𝑟𝑖 and 𝑏 − 𝐴𝑥𝑖 have similar magnitude – But when 𝑟𝑖 → 0, 𝑏 − 𝐴𝑥𝑖 depends on 𝑏 − 𝐴𝑥𝑖 − 𝑟𝑖 • First bound on maximum attainable accuracy for classical KSMs given by Greenbaum (1992) • We have applied a similar analysis to upper bound the maximum attainable accuracy in finite precision CA-KSMs 17 Sources of Roundoff in CA-KSMs Computing the 𝑠-step Krylov basis: 𝐴𝑉𝑘 = 𝑉𝑘 𝐵𝑘 + ∆𝑉𝑘 Error in computing 𝑠-step basis Updating coordinate vectors in the inner loop: 𝑥′𝑘,𝑗 = 𝑥′𝑘,𝑗−1 + 𝑞′𝑘,𝑗−1 + 𝜉𝑘,𝑗 𝑟′𝑘,𝑗 = 𝑟′𝑘,𝑗−1 − 𝐵𝑘 𝑞′𝑘,𝑗−1 + 𝜂𝑘,𝑗 Error in updating coefficient vectors Recovering CG vectors for use in next outer loop: 𝑥𝑠𝑘+𝑗 = 𝑉𝑘 𝑥′𝑘,𝑗 + 𝑥𝑠𝑘 + 𝜙𝑠𝑘+𝑗 𝑟𝑠𝑘+𝑗 = 𝑉𝑘 𝑟′𝑘,𝑗 + 𝜓𝑠𝑘+𝑗 Error in basis change 18 Bound on Maximum Attainable Accuracy • We can write the deviation of the true and updated residuals in terms of these errors: 𝑏 − 𝐴𝑥𝑠𝑘+𝑗 − 𝑟𝑠𝑘+𝑗 = 𝑏 − 𝐴𝑥0 − 𝑟0 𝑘−1 𝑠 − 𝐴𝜙𝑠ℓ+𝑠 + 𝜓𝑠ℓ+𝑠 − Δ𝑉ℓ 𝑥′ℓ,𝑠 + 𝑉ℓ ℓ=0 𝐵ℓ 𝜉ℓ,𝑖 + 𝜂ℓ,𝑖 𝑖=1 𝑗 −𝑉𝑘 𝐵𝑘 𝜉𝑘,𝑖 + 𝜂𝑘,𝑖 + Δ𝑉𝑘 𝑥′𝑘,𝑗 − 𝐴𝜙𝑠ℓ+𝑗 − 𝜓𝑠ℓ+𝑗 𝑖=1 • This can easily be translated into a computable upper bound • Next: using this bound to improve accuracy in a residual replacement strategy 19 Residual Replacement Strategy • Based on Van der Vorst and Ye (1999) for classical KSMs • Idea: Replace updated residual 𝒓𝒎 with true residual 𝒃 − 𝑨𝒙𝒎 at certain iterations to reduce size of their deviation (thus improving attainable accuracy) – Result: When the updated residual converges to level 𝑂(𝜀) 𝐴 𝑥 , a small number of replacement steps can reduce true residual to level 𝑶(𝜺) 𝑨 𝒙 • We devise an analogous strategy for CA-CG and CA-BICG based on derived bound on deviation of residuals – Computing estimates for quantities in bound has negligible cost → residual replacement strategy does not asymptotically increase communication or computation! – To appear in SIMAX 20 CA-CG Convergence, s = 4 CG true CG updated CA-CG (monomial) true CA-CG (monomial) updated CA-CG (Newton) true CA-CG (Newton) updated CA-CG (Chebyshev) true CA-CG (Chebyshev) updated Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) ≈ 10^4 CA-CG Convergence, s = 8 CA-CG Convergence, s = 16 CA-CG Convergence, s = 4 CA-CG Convergence, s = 8 Maximum replacement steps (extra reductions) for any test: 8 CG-RR true CG-RR updated CA-CG-RR (monomial) true CA-CG-RR (monomial) updated CA-CG-RR (Newton) true CA-CG-RR (Newton) updated CA-CG-RR (Chebyshev) true CA-CG-RR (Chebyshev) updated Model Problem: 2D Poisson (5 pt stencil), n = 262K, nnz = 1.3M, cond(A) ≈ 10^4 CA-CG Convergence, s = 16 Residual Replacement can improve accuracy orders of magnitude for negligible cost Ongoing Work: Using Extended Precision • Precision we choose affects both stability and convergence – Errors in 𝑟𝑖 , 𝑝𝑖 , 𝛼𝑖 , and 𝛽𝑖 affect convergence rate (Lanczos process) – Errors in computing 𝑟𝑖 and 𝑥𝑖 affect accuracy (deviation of residuals) True Residual Norm CA-CG convergence (s = 2, monomial basis) Precision: 10 decimal 15 20 places Iterations Example: diag(linspace(1,1e6, 100)), random RHS 23 Summary and Open Problems • For high performance iterative methods, both time per iteration and number of iterations required are important. • CA-KSMs offer asymptotic reductions in time per iteration, but can have the effect of increasing number of iterations required (or causing instability) • We discuss two ways for reducing roundoff effects for linear system solvers: improving basis conditioning and residual replacement techniques • For CA-KSMs to be practical, must better understand finite precision behavior (theoretically and empirically), and develop ways to improve the methods • Ongoing efforts in: finite precision convergence analysis for CA-KSMs, selective use of extended precision, development of CA preconditioners, reorthogonalization for CA eigenvalue solvers, etc. Backup Slides Related Work: 𝑠-step methods Authors KSM Basis Van Rosendale, 1983 CG Monomial Polynomial No No Leland, 1989 CG Monomial Polynomial No No Walker, 1988 GMRES Monomial None No No Chronopoulos and Gear, 1989 CG Monomial None No No Monomial None No No Chronopoulos and Orthomin, Kim, 1990 GMRES Precond? Mtx Pwrs? TSQR? Chronopoulos, 1991 MINRES Monomial None No No Kim and Chronopoulos, 1991 Symm. Lanczos, Arnoldi Monomial None No No Sturler, 1991 GMRES Chebyshev None No No 26 Related Work: 𝑠-step methods Authors KSM Basis Precond? Mtx Pwrs? TSQR? Joubert and Carey, 1992 GMRES Chebyshev No Yes* No Chronopoulos and Kim, 1992 Nonsymm. Lanczos Monomial No No No Bai, Hu, and Reichel, 1991 GMRES Newton No No No Erhel, 1995 GMRES Newton No No No de Sturler and van der Vorst, 2005 GMRES Chebyshev General No No Toledo, 1995 CG Monomial Polynomial Yes* No Chronopoulos and Swanson, 1990 CGR, Orthomin Monomial No No No Chronopoulos and Kinkaid, 2001 Orthodir Monomial No No No 27 Current Work: Convergence Analysis of Finite Precision CA-KSMs • Based on analysis of Tong and Ye (2000) for classical BICG • Goal: bound the residual in terms of minimum polynomial of a perturbed matrix multiplied by amplification factor • Finite precision error analysis gives us Lanczos-type recurrence relation: 𝐴𝑍𝑘,𝑗 𝑉𝑘 𝑟 ′ 𝑘,𝑗+1 𝑇 = 𝑍𝑘,𝑗 𝒯𝑘,𝑗 − 𝑒𝑠𝑘+𝑗+1 + 𝚫𝑘,𝑗 𝛼𝑠𝑘+𝑗 𝑟0 1 • Using this, we can obtain the bound 𝑉𝑘 𝑟 ′ 𝑘,𝑗+1 ≤ 1 + 𝐾𝑘,𝑗 where 𝐾𝑘,𝑗 = 𝐴𝑍𝑘,𝑗 − min 𝜌∈ℙ𝑠𝑘+𝑗+1 ,𝜌 0 =1 −1 𝑇 𝚫𝑘,𝑗 𝒯𝑘,𝑗 𝑊0 amplification factor 𝜌 𝐴 + 𝛿𝐴𝑘,𝑗 𝑟0 + and 𝛿𝐴𝑘,𝑗 = −𝚫𝑘,𝑗 𝑍𝑘,𝑗 𝑠𝑘 + 𝑗 𝑡ℎ residual norm of exact GMRES applied to 𝐴 + 𝛿𝐴𝑘,𝑗 28 Current: Convergence Analysis of CA-BICG in Finite Precision • How can we use this bound? – Given matrix 𝐴 and basis parameters, heuristically estimate maximum 𝒔 such that convergence rate is not significantly different than classical KSM • Use known bounds on polynomial condition numbers – Experiments to give insight on where extended precision should be used • Which quantities dominant bound, which are most important to compute accurately? – Explain convergence despite rank-deficient basis computed by matrix powers (i.e., columns of 𝑉𝑘 are linearly dependent) • Tong and Ye show how to extend analysis for case where 𝑍𝑘,𝑗 has linearly dependent columns, explains convergence despite rank-deficient Krylov subspace 29 Residual Replacement Strategy In finite precision classical CG, in iteration 𝑛, the computed residual and search direction vectors satisfy 𝑟𝑛 = 𝑟𝑛−1 − 𝛼𝑛−1 𝐴𝑝𝑛−1 + 𝜂𝑛 𝑝𝑛 = 𝑟𝑛 + 𝛽𝑛 𝑝𝑛−1 + 𝜏𝑛 Then in matrix form, 1 𝑟𝑛+1 𝑇 𝐴𝑍𝑛 = 𝑍𝑛 𝑇𝑛 − ′ 𝑒 + 𝐹𝑛 𝛼𝑛 𝑟0 𝑛+1 with 𝑍𝑛 = 𝑟0 𝑟𝑛 ,…, 𝑟0 𝑟𝑛 where 𝑇𝑛 is invertible and tridiagonal, 𝛼𝑛′ = 𝑒𝑛𝑇 𝑇𝑛−1 𝑒1 and 𝐹𝑛 = 𝑓0 , … , 𝑓𝑛 , with 𝐴𝜏𝑛 1 𝜂𝑛+1 𝛽𝑛 𝜂𝑛 𝑓𝑛 = + − 𝑟𝑛 𝛼𝑛 𝑟𝑛 𝛼𝑛−1 𝑟𝑛 CA-CG Using the term 𝜂𝑛 derived for the CG recurrence with residual replacement, we know we can replace the residual without affecting convergence when 𝜼𝒏 is not too large relative to 𝒓𝒏 and 𝒓𝒏−𝟏 (Van der Vorst and Ye, 1999). 30 Residual Replacement Strategy In iteration 𝑛 = 𝑠𝑘 + 𝑗 + 𝑚 of finite precision CA-CG, where 𝑚 was the last residual replacement step, we have 𝐶𝐴 𝑟𝑠𝑘+𝑗+𝑚 = 𝑟𝑠𝑘+𝑗+𝑚−1 − 𝐴𝑞𝑠𝑘+𝑗+𝑚−1 + 𝜂𝑠𝑘+𝑗+𝑚 where 𝐶𝐴 𝜂𝑠𝑘+𝑗+𝑚 ≤ 𝜀 𝑟𝑚 + 𝜀 1 + 2𝑁 𝐴 𝑥𝑚 𝑘−1 +𝜀 𝐴 𝑥′𝑠ℓ + 𝑁𝑠 𝑉ℓ 𝑥′′ℓ,𝑠 + 𝑁𝑠 𝑉ℓ 𝑟′′ℓ,𝑠 + ∆𝑉ℓ 𝑥′′ℓ,𝑠 ℓ=0 𝑘−1 𝑠 +𝜀 1 + 2𝑁𝑠 𝑉ℓ 𝐵ℓ 𝑥 ′′ ℓ,𝑖 + 𝑉ℓ 𝑟 ′′ ℓ,𝑖 ℓ=0 𝑖=1 𝑗 +𝜀 1 + 2𝑁𝑠 𝑉𝑘 𝐵𝑘 𝑥 ′′ 𝑘,𝑖 𝑖=1 +𝜀(2 + 𝑁 + 𝑁𝑠 ) 𝐴 + 𝑉𝑘 𝑟 ′′ 𝑘,𝑖 𝑉𝑘 𝑥 ′′ 𝑘,𝑗 Assuming we have an estimate for 𝑨 , we can compute all quantities in the bound without asymptotically increasing comm. or comp. 31 Residual Replacement Algorithm Based on the perturbed Lanczos recurrence (see, e.g. Van der Vorst and Ye, 1999), we should replace the residual when 𝑑𝑠𝑘+𝑗−1 ≤ 𝜀 𝑟𝑠𝑘+𝑗−1 𝒂𝒏𝒅 𝑑𝑠𝑘+𝑗 > 𝜀 𝑟𝑠𝑘+𝑗 𝒂𝒏𝒅 𝑑𝑠𝑘+𝑗 > 1.1𝑑 𝑖𝑛𝑖𝑡 where 𝜀 is a tolerance parameter, chosen as 𝜀 = 𝜀 We can write the pseudocode for residual replacement with group update: if 𝑑𝑠𝑘+𝑗−1 ≤ 𝜀 𝑟𝑠𝑘+𝑗−1 𝒂𝒏𝒅 ′ 𝑧 = 𝑧 + 𝑉𝑘 𝑥𝑘,𝑗 + 𝑥𝑠𝑘 𝑥𝑠𝑘+𝑗 = 0 𝑟𝑠𝑘+𝑗 = 𝑏 − 𝐴𝑧 𝑑𝑖𝑛𝑖𝑡 = 𝑑𝑠𝑘+𝑗 = 𝜀 𝑑𝑠𝑘+𝑗 > 𝜀 𝑟𝑠𝑘+𝑗 1 + 2𝑁𝐴 𝐴 𝒂𝒏𝒅 𝑑𝑠𝑘+𝑗 > 1.1𝑑𝑖𝑛𝑖𝑡 𝑧 + 𝑟𝑠𝑘+𝑗 break from inner loop end where we set 𝑑𝑖𝑛𝑖𝑡 = 𝜀 𝑁𝐴 𝐴 𝑥0 + 𝑟0 at the start of the algorithm. 32 The Matrix Powers Kernel Avoids communication: • In serial, by exploiting temporal locality: – Reading A – Reading V := [x, Ax, A2x, …, Asx] • In parallel, by doing only 1 ‘expand’ phase (instead of s). • Tridiagonal Example: Also works for general graphs! black = local elements red = 1-level dependencies green = 2-level dependencies blue = 3-level dependencies A3v A2v Av v Sequential A3v A2v Av v Parallel 33 Current: Multigrid Bottom-Solve Application • AMR applications that use geometric multigrid method – Issues with coarsening require switch from MG to BICGSTAB at bottom of V-cycle • At scale, MPI_AllReduce() can dominate the runtime; required BICGSTAB iterations grows quickly, causing bottleneck bottom-solve • Proposed: replace with CA-BICGSTAB to improve performance CA-BICGSTAB • Collaboration w/ Nick Knight and Sam Williams (FTG at LBL) • Currently ported to BoxLib, next: CHOMBO (both LBL AMR projects) • BoxLib applications: combustion (LMC), astrophysics (supernovae/black hole formation), cosmology (dark matter simulations), etc. 34 Current: Multigrid Bottom-Solve Application • Challenges – Need only ≈ 2 orders reduction in residual (is monomial basis sufficient?) – How do we handle breakdown conditions in CA-KSMs? – What optimizations are appropriate for matrix-free method? – Some solves are hard (≈100 iterations), some are easy (≈5 iterations) • Use “telescoping” 𝑠: start with s=1 in first outer iteration, increase up to max each outer iteration – Cost of extra point-to-point amortized for hard problems Preliminary results: Hopper (1K^3, 24K cores): 3.5x in bottom-solve (1.3x total) Edison (512^3, 4K cores): 1.6x bottom-solve (1.1x total) Proposed: • Demonstrate CA-KSM speedups in other 2+ other HPC applications • Want apps which demonstrate successful use of strategies for stability, convergence, and performance derived in this thesis 35 LMC-2D mac_project solve

pptx

Related documents

Products

Support

pptx

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib