Iterative Methods 1 Serial Iterative Methods March 30, 2015

Iterative Methods March 30, 2015 1 Serial Iterative Methods Iterative methods for solving a linear system Ax = b are used when methods such as Gaussian elimination require too much time or too much space. Methods such as Gaussian elimination, LU decomposition followed by back substitution, that compute the exact solution after a finite number of steps (in the absence of roundoff) are called direct methods. In contrast to direct methods, iterative methods generally do not produce the exact answer after a finite number of steps. but decrease the norm of the residual kb − Axk, or some other measure of error by some fraction after each step. Iteration ceases when the error is less than a user-supplied threshold. The final error depends on how many iterations one does as well as on the properties of the method and the linear system. It also depends on the machine precision of the target architecture. For example if you are computing using single precision datatypes (float) then it does not make much sense to set the tolerance lower than 10−7 (Why?). Our overall goal is to develop methods that decrease the error by a large amount at each iteration and do as little work per iteration as possible. Iterative methods are especially useful when the matrix A is sparse because, unlike direct methods, no fill occurs. Also, compared with direct methods, iterative methods are easier to parallelize. For extremely large scale problems an added advantage is that iterative methods only require the application of the Matrix-Vector product (matvec) and are therefore suitable for matrix-free implementations. We shall first quickly review the Poisson equation and its discretizations using finite differences. We shall then review some popular iterative methods before looking at their parallel versions. 1.1 1.1.1 Poisson’s Equation Poisson’s equation in One Dimension We begin with a one-dimensional version of Poisson’s equation, − d2 v(x) = f (x), dx2 0 < x < 1, (1) where f (x) is a given function and v(x) is the unknown function that we want to compute. v(x) must also satisfy the boundary conditions1 v(0) = v(1) = 0. We discretize the problem by trying to compute an approximate solution at N + 2 evenly spaced points xi between 0 and 1: xi = ih, where h = N1+1 and 0 ≤ i ≤ N + 1. We abbreviate vi = v(xi ) and fi = f (xi ). To convert the 1 These are called Dirichlet boundary conditions. Other kinds of boundary conditions are also possible. 1 CS6230: HPC & Parallel Algorithms 2 differential equation (1) into a linear equation for the unknowns v1 , . . . , vN , we use finite differences to approximate vi − vi−1 dv(x) ≈ dx x=(i−.5)h h dv(x) vi+1 − vi ≈ dx x=(i+.5)h h Subtracting these approximations and dividing by h yield the centered difference approximation 2vi − vi−1 − vi+1 d2 v(x) ≈ − τi , (2) − 2 dx h2 x=xi 4 d v where τi , the so-called truncation error, can be shown to be O h2 · dx . We may now rewrite 4 ∞ equation (1) at x = xi as −vi−1 + 2vi − vi+1 = h2 fi + h2 τi , where 0 < i < N + 1. Since the boundary conditions imply that v0 = equations N unknowns v1 , . . . , vN :      v1 2 −1 0 v1     ..  . . . ..   ..  −1 . .  .       AN ·  .  ≡  · . .. ..   ..  . −1   .. . vN 0 −1 2 vN     τ1 f1  ..   ..     .  2 .   = h2   . +h  .   ..   ..  τN fN vN +1 = 0, we have N       (3) (4) or AN v = h2 f + h2 τ̄ (5) To solve this equation, we will ignore τ̄ , since it is small compared to f , to get AN v = h2 f (6) This is equivalent to the archetypal Ax = b system, with x = v and b = h2 f . 1.1.2 Poisson’s Equation in Two Dimensions Now we turn to Poisson’s equation in two dimensions: − ∂ 2 v(x, y) ∂ 2 v(x, y) − = f (x, y) ∂x2 ∂y 2 (7) on the unit square {(x, y) : 0 < x, y < 1}, with boundary condition v = 0 on the boundary of the square. We discretize at the grid points in the square which are at (xi , yj ) with xi = ih and yj = jh, with h = N1+1 . We abbreviate vij = v(ih, jh) and fij = f (ih, jh), as shown below for N = 3: CS6230: HPC & Parallel Algorithms 3 y h j = 4 h j = 3 j = 2 j = 1 x j = 0 i = 0 i = 1 i = 2 i = 3 i = 4 From equation (2), we know that we can approximate ∂ 2 v(x, y) 2vi,j − vi−1,j − vi+1,j − ≈ and 2 ∂x h2 x=xi ,y=yi ∂ 2 v(x, y) 2vi,j − vi,j−1 − vi,j+1 − ≈ . ∂y 2 x=xi ,y=yi h2 Adding these approximations lets us write ∂ 2 v(x, y) ∂ 2 v(x, y) − = − ∂x2 ∂y 2 x=xi ,y=yi 4vi,j − vi−1,j − vi+1,j − vi,j−1 − vi,j+1 − τij h2 (8) where τij is again a truncation error bounded by O(h2 ). The blue cross in the middle of the above figure is called the (5-point) stencil of this equation, because it connects all (5) values of v present in equation (8). From the boundary conditions we know v0,j = vN +1,j = vi,0 = vi,N +1 = 0, so that equation (8) defines a set of n = N 2 linear equations in the n unknowns vij for 1 ≤ i, j ≤ N : 4vij − vi−1,j − vi+1,j − vi,j−1 − vi,j+1 = h2 fij . 1.2 (9) Summary of Methods for Solving Poisson’s Equation Table 1 lists the costs of various direct and iterative methods for solving the model problem on an N × N grid. The variable n = N 2 , is the number of unknowns. Since direct methods provide the exact answer (in the absence of roundoff), whereas iterative methods provide only approximate answers, we must be careful when comparing their costs, since a low-accuracy answer can be computed more cheaply by an iterative method than a high-accuracy answer. Therefore, we compare costs, assuming that the iterative methods iterate often enough to make the error at most some fixed small value2 (say, 10−6 ). 2 Alternatively, we could iterate until the error is O(h2 ) = O((N + 1)−2 ), the size of the truncation error. One can show that this would increase the costs of the iterative methods in Table 1 by a factor of O(log n) CS6230: HPC & Parallel Algorithms 4 Method Dense Cholesky Explicit inverse Band Cholesky Jacobi’s Gauß-Seidel Sparse Cholesky Conjugate Gradient SOR SSOR with Chebyshev accel. Fast Fourier Transform Block cyclic reduction Multigrid lower bound Serial time n3 n2 n2 n2 n2 n3/2 n3/2 n3/2 n5/4 n log n n log n n n Space n2 n2 n3/2 n n n log n n n n n n n n Type Direct Direct Direct Iterative Iterative Direct Iterative Iterative Iterative Direct Direct Iterative Table 1: Order of complexity of solving Poisson’s equation on an N × N grid (n = N 2 ). 2 Stationary Iterative Methods The oldest and simplest iterations for solving linear systems are stationary iterations. These iterations have largely been supplanted by more sophisticated methods (such as Krylov subspace methods), but they are still a useful building block. Stationary iterations are so named because the solution to a linear system is expressed as finding the stationary point (fixed point) of some fixed-point iteration x(k+1) = F (x(k) ). As is usually the case with fixed point iterations—linear or nonlinear—the simplest way to establish convergence is generally to establish that the mapping is a contraction, i.e. kF (x) − F (y)k ≤ αkx − yk, α < 1. The constant α then establishes the rate of convergence. If we are solving a linear equation Ax = b, it generally makes sense to choose a fixed point iteration where the mapping F is affine. We can write any such fixed point iteration via a splitting of the matrix A, i.e. by writing A = M − K with M nonsingular. Then we can rewrite Ax = b in the form M x = Kx + b, or, equivalently, x = x + M −1 (b − Ax). The fixed point iteration is then x(k+1) = M −1 (Kx(k) + b) = x(k) + M −1 (b − Ax(k) ). If we define R = M −1 K, and c = M −1 b, we can write the iteration as x(k+1) = Rx(k) + c, CS6230: HPC & Parallel Algorithms 5 and the error e(k) = x(k) − x obeys the iteration e(k+1) = Re(k) . A sufficient condition for convergence is then that kRk < 1 in some operator norm. The actual necessary and sufficient condition is that ρ(R) < 1, where the spectral radius ρ(R) is defined as max |λ| over eigenvalues λ of R. 2.1 Richardson Iteration Perhaps the simplest stationary iteration is the Richardson iteration, in which M is chosen to be proportional to the identity: x(k+1) = xk + ω(b − Axk ). The iteration matrix in this case is simply R = I − ωA. As long as all the eigenvalues of A have positive real part, Richardson iteration with a small enough ω will eventually converge—but that convergence may take a very long time. In the case that A is symmetric and positive definite, the eigenvalues of R = I − ωA are 1 − ωλ, where λ are the eigenvalues of A. Since in this case R is symmetric, kRk2 is the largest singular value (largest eigenvalue magnitude) kRk2 = max(|1 − ωλmax |, |1 − ωλmin |). The rate of convergence is optimal when |1 − ωλmax | = |1 − ωλmin |, which occurs when ω = 2/(λmax + λmin ). In this case, the rate of convergence in the 2-norm is determined by 2 2λmin =1− . kRk2 = 1 − λmax + λmin κ(A) + 1 Thus, if A is ill-conditioned, the iteration may be painfully slow. 2.2 Jacobi Method Jacobi iteration is usually introduced by talking about “sweeping” through the variables and updating each one based on the assumption that the other variables are correct. Component by component, we have X (k+1) (k) aii xi + aij xj = bi , j6=i or,  (k+1) xi = bi −  X (k) aij xj  /aii . j6=i Alternately, we can think of Jacobi’s iteration as taking M = D to be the diagonal part of A. The iteration matrix in this case is R = I − D −1 A. CS6230: HPC & Parallel Algorithms 6 If D, L, U are the diagonal, strict lower triangular, and strict upper triangular portions of A, then the Jacobi method can be written as, x(k+1) = D −1 b − (L + U )x(k) . If A is strictly row diagonally dominant, then kRk∞ < 1,, and the iteration converges. When we discuss multigrid, we will also use as a building block the damped Jacobi iteration where M = ω −1 D for some ω < 1. • Jacobi method requires nonzero diagonal entries, which can usually be accomplished by permuting rows and columns if not already true • Jacobi method requires duplicate storage for x, since no component can be overwritten until all new values have been computed • Components of new iterate do not depend on each other, so they can be computed simultaneously • Jacobi method does not always converge, but it is guaranteed to converge under conditions that are often satisfied (e.g., if matrix is strictly diagonally dominant), though convergence rate may be very slow 2.3 Gauß-Seidel Method For the Jacobi iteration, we think of using equation j to update variable xj under the assumption that the old values for all neighboring variables are correct. For the Gauß-Seidel iteration, we think of updating x1 , x2 , . . . , at each step using the most recent values of all the other variables for updates. That is, we update according to X X (k+1) (k) aij xj + aij xj = bi . j>i j≤i If we write A = D − L̃ − Ũ = D(I − L − U ) where −L̃ and −Ũ are the strictly lower and upper triangular parts of A, then Gauß-Seidel corresponds to using M = D(I − L), and the iteration operator is R = (I − L)−1 U In the case of strict row diagonal dominance, kRk∞ < 1; in fact, if RGS and RJ are the iteration operators for Gauß-Seidel and Jacobi, then for strictly row diagonally dominant A we have kRGS k∞ ≤ kRJ k < 1. We can also show that Gauß-Seidel converges in the symmetric positive definite case. Because the analysis technique will be relevant to some later discussions, we will take a moment to describe the argument. If A is symmetric positive definite, then the solution of Ax = b is also the unique minimum of the quadratic function φ(x) = 1 T x Ax − xT b 2 CS6230: HPC & Parallel Algorithms 7 Now suppose that x̂ is an approximate solution, and we want to get a better solution of the form x̂0 = x̂ + αei . Note that 1 2 α aii + αeTi (Ax − b) , φ(x̂ + αei ) = φ(x̂) + 2 which we can minimize by choosing aii α = eTi (b − Ax). This exactly corresponds to the update aii x(new) + X aij x(prev) = bi . j6=i Thus, Gauß-Seidel iteration can be seen as a coordinate-descent minimization algorithm with exact line searches. Furthermore, note that if Ax = b and x̂ = x + e is an approximation, then 1 φ(x̂) = φ(x) + eT Ae, 2 so φ(x̂) − φ(x) = kek2A /2, where kekA is the error in the “energy norm” induced by the positive definite matrix A. So in this case, the Gauß-Seidel iteration is monotonically convergent in the norm associated with A. In many practical cases, even those that are not strictly diagonally dominant or symmetric and positive definite, Gauß-Seidel converges somewhat faster than Jacobi. But proving this requires knowing something about the structure of the problem. Outside of strictly row-diagonally dominant A, there are examples where Jacobi converges and Gauß-Seidel does not, and vice-versa. In summary, • Gauß-Seidel requires nonzero diagonal entries • Gauß-Seidel does not require duplicate storage for x, since component values can be overwritten as they are computed • But each component depends on previous ones, so they must be computed successively • Gauß-Seidel does not always converge, but it is guaranteed to converge under conditions that are somewhat weaker than those for Jacobi method (e.g., if matrix is symmetric and positive definite) • Gauß-Seidel converges about twice as fast as Jacobi, but may still be very slow 2.4 Successive Over-relaxation (SOR) If the Gauß-Seidel iteration gives us a good update, perhaps going even further in the GaußSeidel direction would give an even better update. This is the idea behind SOR (successive overrelaxation); (k+1) x(k+1) = (1 − ω)x(k) + ωxGS CS6230: HPC & Parallel Algorithms 8 The case where ω < 1 is called under-relaxation; the case ω > 1 is over-relaxation. If A is symmetric, SOR can be written as the application of a symmetric matrix; this is the Symmetric Successive Over-Relaxation method. SOR diverges unless 0 < ω < 2, but choosing optimal ω is difficult in general except for special classes of matrices. With optimal value for ω, convergence rate of SOR method can be order of magnitude faster than that of Gauß-Seidel 3 Krylov Subspace Methods These methods are used both to solve Ax = b and to find eigenvalues of A. They assume that A is accessible only via a “black-box” subroutine that returns y = Az given any z (and perhaps y = AT z if A is nonsymmetric). In other words, no direct access or manipulation of matrix entries is used. This is a reasonable assumption for several reasons. First, the cheapest nontrivial operation that one can perform on a (sparse) matrix is to multiply it by a vector; if A has m nonzero entries, matrix-vector multiplication costs m multiplications and (at most) m additions. Second, A may not be represented explicitly as a matrix but may be available only as a subroutine for computing Ax. A variety of different Krylov subspace methods exist. Some are suitable for nonsymmetric matrices, and others assume symmetry or positive definiteness. Some methods for nonsymmetric matrices assume that AT z can be computed as well as Az; depending on how A is represented, AT z may or may not be available. The most efficient and best understood method, the conjugate gradient method (CG), is suitable only for symmetric positive definite matrices, including the model problem. 3.1 Conjugate Gradient Method We say that two non-zero vectors u and v are conjugate (with respect to A) if uT Av = 0. Since A is symmetric and positive definite, the left-hand side defines an inner product hu, viA := hAu, vi = hu, AT vi = hu, Avi = uT Av. Two vectors are conjugate if and only if they are orthogonal with respect to this inner product. Being conjugate is a symmetric relation: if u is conjugate to v, then v is conjugate to u. Suppose that P = {pk : ∀i 6= k, i, k ∈ [1, n], hpi , pk iA = 0} is a set of n mutually conjugate directions. Then P is a basis of Rn , so within P we can expand the solution x∗ of Ax = b : x∗ = n X αi pi i=1 and we see that b = Ax∗ = n X αi Api . i=1 For any pk ∈ P , T pT k b = pk Ax∗ = n X i=1 T αi pT k Api = αk pk Apk . CS6230: HPC & Parallel Algorithms 9 (because ∀i 6= k, pi , pk are mutually conjugate) αk = pT hpk , bi hpk , bi kb = = . T hpk , pk iA kpk k2A pk Apk This gives the following method for solving the equation Ax = b: find a sequence of n conjugate directions, and then compute the coefficients αk . If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain a good approximation to the solution x∗ . So, We denote the initial guess for x∗ by x0 . Starting with x0 we search for the solution and in each iteration we need a metric to tell us whether we are closer to the solution x∗ . This metric comes from the fact that the solution x∗ is also the unique minimizer of the following quadratic function; f (x) = 1 T x Ax − xT b, 2 x ∈ Rn . This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0 . The gradient of f equals Axb. Starting with a x0 , this means we take p0 = bAx0 . The other vectors in the basis will be conjugate to the gradient, hence the name conjugate gradient method. Let rk be the residual at the k th step: rk = b − Axk . Note that rk is the negative gradient of f at x = xk , so the gradient descent method would be to move in the direction rk . Here, we insist that the directions pk be conjugate to each other. We also require that the next search direction be built out of the current residue and all previous search directions, which is reasonable enough in practice. This gives the following expression: pk = rk − X pT Ark i i<k pT i Api pi Following this direction, the next optimal location is given by xk+1 = xk + αk pk with αk = pT kb T pk Apk = pT pT k (rk−1 + Axk−1 ) k rk−1 = T , T pk Apk pk Apk where the last equality holds because pk and xk−1 are conjugate. Key features of Conjugate Gradient • Short recurrence determines search directions that are A-orthogonal (conjugate) • Error is minimal over space spanned by search directions generated so far • Minimum error property implies that method produces exact solution after at most n steps CS6230: HPC & Parallel Algorithms 10 • In practice, rounding error causes loss of orthogonality that spoils finite termination property, so method is used iteratively • Error is reduced at each iteration by factor of √ √ ( κ − 1)/( κ + 1) on average where κ = cond(A) = kAk · kA−1 k = λmax (A)/λmin (A) • Thus, convergence tends to be rapid if matrix is well-conditioned, but can be arbitrarily slow if matrix is ill-conditioned • But convergence also depends on clustering of eigenvalues of A. 3.2 Preconditioning In the previous section we saw that the convergence rate of CG depended on the condition number of A, or more generally the distribution of A’s eigenvalues. Other Krylov subspace methods have the same property. Preconditioning means replacing the system Ax = b with the system M−1 Ax = M−1 b, where M is an approximation to A with the properties that 1. M is symmetric and positive definite, 2. M−1 A is well conditioned or has few extreme eigenvalues, 3. Mx = b is easy to solve. A careful, problem-dependent choice of M can often make the condition number of M−1 A much smaller than the condition number of A and thus accelerate convergence dramatically. Indeed, a good preconditioner is often necessary for an iterative method to converge at all, and much current research in iterative methods is directed at finding better preconditioners. We will consider Multigrid as a preconditioner for CG later in class. The combination results is a very powerful optimal solver. 4 Parallel Implementation Iterative methods for linear systems are composed of basic operations such as • vector updates (daxpy) • inner products • matrix-vector multiplication (matvec) • solution of triangular systems CS6230: HPC & Parallel Algorithms 11 In the parallel implementation, both data and operations are partitioned across multiple tasks. In addition to the communication required for these basic operations, necessary convergence test may require additional communication (e.g., sum or max reduction). Iterative methods typically require several vectors, including solution x, right-hand side b, residual r = b − Ax, and possibly others. These vectors are typically uniformly partitioned among p tasks, with a given task holding the same set of component indices of each vector. Thus, vector updates require no communication, whereas inner products of vectors require reductions across tasks, at costs we have already seen. 4.1 Partitioning of Sparse Matrices and Matrix-free implementations Sparse matrix A can be partitioned among tasks by rows, by columns, or by submatrices. Partitioning using equal-sized submatrices may give uneven distribution of nonzeros among tasks; indeed, some submatrices may contain no nonzeros at all. Adaptively determining submatrices can give better results, but is a hard problem to partition. Partitioning by rows or by columns tends to yield more uniform distribution because sparse matrices typically have about same number of nonzeros in each row or column. For large systems, it is generally preferable to use matrix-free implementations. In such methods, instead of storing and having to partition the matrix, we compute the matrix-vector product (matvec) directly. This means that the partitioning is done only on the vectors, say x and b. Given a vector x, each process owns a set of indices, and computes Ax for those indices, by using the analytical expressions for the matrix entries. For example in the case of the 2D Poisson problem, we would use equation (9) and compute the matrix vector product or the residual. Note that communication might be required if any of the indices (i ± 1, j), (i, j ± 1) are not owned by the same process. If the processes are arranged in a Cartesian topology and the domain is assigned accordingly, then these communications will only have to be performed with immediate neighbors. Note that it is possible to overlap communication with computation while performing the matvec. These are the steps, • The process is ready to receive data from its neighbors (MPI Irecv) • Send own process-boundary data to the appropriate neighbors (MPI Isend) • compute process-internal matvec • wait for communication to finish (MPI Wait) • compute process-boundary matvec 4.2 Termination of iterations Termination of iterative methods requires the computation of some form of error-norm (`2 , `∞ , energy), to determine when the solution is close, allowing us to terminate the iteration. These require communication and more importantly require a reduction (typically with a sum or max). Since reduction is a collective communication, it can be detrimental for large problems. For example, the time for a single Jacobi iteration for our sample problem is likely to be cheaper than the norm computation (at least for large p). In such cases a better strategy is to iterate a few times without checking for the terminating condition to amortize the cost of norm computation. This is less critical for methods like CG that require the norm to be computed.

Iterative Methods 1 Serial Iterative Methods March 30, 2015

Related documents

Products

Support

Iterative Methods 1 Serial Iterative Methods March 30, 2015

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib