Class Notes 11: THE CONJUGATE GRADIENT METHOD

Class Notes 11: THE CONJUGATE GRADIENT METHOD Math 639d Due Date: Nov. 13 (updated: November 6, 2014) In this set of notes, A will be an symmetric and positive definite real n × n matrix. Everything that is to be done also works for positive definite Hermitian complex n × n matrices. The steepest descent method makes the error ei+1 A-orthogonal ri . Unfortunately, if this step results in little change, then ri+1 lies in almost the same direction as ri so the step to ei+2 is not very effective either since ei+1 is already A-orthogonal to ri and almost A-orthogonal to ri+1 . This is a shortcoming of the steepest descent method. The fix is simple. We generalize the algorithm and let pi be the direction which we use to compute ei+1 . We make ei+1 A-orthogonal to pi . The idea is to preserve this orthogonality when going to ei+2 . Since ei+1 is already Aorthogonal to pi , ei+2 will remain A-orthogonal to pi only if our new search direction pi+1 is also A-orthogonal to pi . Thus, instead of using ri+1 as our next search direction, we use the component of ri+1 which is A-orthogonal to pi , i.e., pi+1 = ri+1 − βi pi where βi is chosen so that (pi+1 , pi )A = 0. A simple computation gives βi = (ri+1 , pi )A . (pi , pi )A We continue by making ei+2 A-orthogonal to pi+1 , i.e., xi+2 = xi+1 + αi+1 pi+1 with αi+1 satisfying (ei+1 − αi+1 pi+1 , pi+1 )A = 0 or αi+1 = (ri+1 , pi+1 ) . (Api+1 , pi+1 ) As both ei+1 and pi+1 are A-orthogonal to pi and ei+2 = ei+1 − αi+1 pi+1 , ei+2 is A-orthogonal to both pi and pi+1 . The above discussion leads to the following algorithm. 1 2 Algorithm 1. (Conjugate Gradient). Let A be a SPD n × n real matrix and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand side) be given. Start by setting p0 = r0 = b − Ax0 . Then for i = 0, 1, . . ., define (ri , pi ) xi+1 = xi + αi pi , where αi = , (Api , pi ) ri+1 = ri − αi Api , (11.1) pi+1 = ri+1 − βi pi , where βi = (ri+1 , Api ) . (Api , pi ) Remarks: • The above algorithm works for complex valued Hermitian positive definite matrices A provided that one uses the sesquilinear inner product (u, v) := u · v̄ with v̄ denoting the complex conjugate of v. • Notice that we have moved the matrix-vector evaluation in the above inner products so that it is clear that only one matrix-vector evaluation, namely Api , is required per iterative step after start up. • The first step in the conjugate gradient method coincides with the steepest descent algorithm. It follows that the CG algorithm is not a linear iterative method. We illustrate pseudo code for the conjugate gradient algorithm below: We have implicitly assumed that A(X) is a routine which returns the result of A applied to X and IP (·, ·) returns the result of the inner product. Here k is the number of iterations, X is x0 on input and X is xk on return. F U N CT ION CG(X, B, A, k, IP ) R = P = B − A(X); F OR j = 1, 2, . . . , k DO { AP = A(P ); al = IP (R, P )/IP (P, AP ); X = X + al ∗ P ; R = R − al ∗ AP ; be = IP (R, AP )/IP (P, AP ); P = R − be ∗ P ; } RET U RN EN D The above code is somewhat terse but was included to illustrate the fact that one can implement CG with exactly 3 extra vectors, R, P , and AP . An actual code would include extra checks for consistency and convergence. For example, a tolerance might be passed and the residual might be tested against it causing the routine to return when the desired tolerance was achieved. Also, for consistency, one would check to see that (Ap, p) > 0 3 for when this in negative or zero, it is a sure sign that the matrix is either no good (not SPD) or that you have iterated to convergence (we shall see below that the solution has been reached when (Ap, p) = 0). Below, HW indicates that the result is part of the next homework assignment. To analyze the conjugate gradient method, we start by introducing the so-called Krylov subspace, Km := Km (A, r0 ) := span{r0 , Ar0 , A2 r0 , . . . , Am−1 r0 }. It is clear that the dimension of Km is at most m. Sometimes, Km has dimension less than m, for example, if r0 was an eigenvector, then dim(Km )=1, for m = 1, 2, . . ., since Aj r0 = λj r0 . A simple argument by mathematical induction (HW), shows that the search direction pi is in Ki+1 , for i = 0, 1, . . .. It follows that xi = x0 + θ for some θ ∈ Ki and ei = e0 − θ (for the same θ). We note that the dimension of the Krylov space keeps growing until Al r0 ∈ Kl in which case Kl = Kl+1 = Kl+2 · · · (HW). Let l0 be the minimal such / Kl0 −1 . Then there are coefficients, value, i.e., Al0 r0 ∈ Kl0 but Al0 −1 r0 ∈ c0 , . . . , cl0 −1 satisfying (11.2) c0 r0 + c1 Ar0 + . . . cl0 −1 Al0 −1 r0 = Al0 r0 . If c0 = 0, multiplying (11.2) by A−1 would imply that Al0 −1 r0 ∈ Kl0 −1 . As we have assumed that this is not the case, we must have c0 6= 0 and so l0 −1 e0 = A−1 r0 = c−1 r0 − c1 r0 − c2 Ar0 · · · − cl0 −1 Al0 −2 r0 ) ∈ Kl0 . 0 (A Similar arguments show that l0 is, in fact, the smallest index such that e0 ∈ Kl0 (HW). It is natural to consider the best approximation with respect to the A−norm to x over vectors of the form x̃i = x0 + θ for θ ∈ Ki i.e., (11.3) kx − x̃i kA := kx − (x0 + θ)kA := min kx − (x0 + ζ)kA . ζ∈Ki This can be rewritten (11.4) kẽi kA = min ke0 − ζkA ζ∈Ki with ẽi = x − x̃i . When i = l0 , we may take ζ = e0 ∈ Kl0 to conclude that ẽl0 = 0 and hence xl0 = x, i.e., we have the solution at the l0 ’th step. The minimization problem with respect to a finite dimensional space associated with a norm which comes from an inner product can always be 4 reduced to a matrix problem. We illustrate this for the above minimization problem in the following lemma: Lemma 1. There is a unique solution θ ∈ Ki solving the minimization problem (11.3). It is characterized as the unique function of this form satisfying (11.5) (x − x0 − θ, ζ)A = (x − x̃i , ζ)A = 0 for all ζ ∈ Ki . Proof. We first show that there is a unique function θ ∈ Ki for which (11.5) holds. Given a basis {v1 , . . . , vl } for the Krylov space Ki , we expand θ= l X cj vj . j=1 The condition (11.5) is equivalent to l X = 0 for m = 1, . . . , l x − x0 − cj vj , vm A j=1 or, using the bilinearity of (·, ·)A , l X j=1 cj (vj , vm )A = (b − Ax0 , vm ) for m = 1, . . . , l. This is the same as the matrix problem N c = F with N ∈ Rl×l and F, c ∈ Rl and Nm,j = (vj , vm )A and Fm = (b − Ax0 , vm ), j, m = 1, . . . , l. This problem has a unique solution if N is nonsingular. To check this, we let d ∈ Rl be arbitrary and compute (N d, d) = l X (Nj,k dk )dj = = k=1 (vk , vj )A dk dj j,k=1 j,k=1 X l l X dk v k , l X j=1 dj v j = (w, w)A A P where w = lj=1 dj vj . Since {v1 , . . . , vl } is a basis, w is nonzero whenever d is nonzero. It follows from the definiteness property of the inner product that 0 6= (w, w)A = (N d, d) whenever d 6= 0. Thus, 0 is the only vector for which N d = 0, i.e., the matrix N is nonsingular and there is a unique θ ∈ Ki satisfying (11.5). 5 We next show that the unique solution θ of (11.5) is a solution to the minimization problem. Indeed, if ζ is in Ki then kx − x̃i k2A = (x − x̃i , x − [(x0 + ζ) + (θ − ζ)])A = (x − x̃i , x − (x0 + ζ))A ≤ kx − x̃i kA kx − (x0 + ζ)kA where we used the Cauchy-Schwarz inequality for the inequality above. It follows that kx − x̃i kA ≤ kx − (x0 + ζ)kA . As this holds for all ζ in ki , θ is a minimizer. The proof will be complete once we show that anything which is a minimizer satisfies (11.5) and hence (11.5) gives rise to the unique minimizer. Suppose that θ solves the minimization problem and x̃i = x0 + θ. Given ζ ∈ Ki , we consider for t ∈ R, f (t) = kx − x̃i + tζk2A = (x − x̃i + tζ, x − x̃i + tζ)A . By using bilinearity of the inner product, we see that f (t) = (x − x̃i , x − x̃i )A + 2t(x − x̃i , ζ)A + t2 (ζ, ζ)A . Since x̃i is the minimizer, f ′ (0) = 2(x − x̃i , ζ)A = 0, i.e., (11.5) holds. p The proof of the above lemma shows one way of solving the minimization problem, i.e., constructing the matrix N , solving N c = F and setting x̃i = x0 + l X cj vj . j=0 The following theorem shows that the conjugate gradient algorithm provides a much more elegant solution. Theorem 1. (CG-equivalence) Let l0 be as above. Then for i = 1, 2, . . . , l0 , xi defined by the conjugate gradient (CG) algorithm coincides with the solution x̃i of the minimization problem (11.3). The above theorem shows that the conjugate gradient algorithm provides a very efficient implementation of the Krylov minimization problem (11.3). As discussed above, at least mathematically, we get the exact solution on the l0 ’th step (xl0 = x). At this point the CG algorithm breaks down as rl0 = pl0 = 0 and so the denominators are zero. Historically, when the conjugate gradient method was discovered, it was proposed as a direct solver. Since, Kl ⊆ Rn , l0 can at most be n. The above theorem shows that we get convergence in at most l0 iterations. However, it was found that, because of round-off errors, implementations of the CG 6 method sometimes failed to converge in n iterations. Nevertheless, CG is very effective when used as an iterative method on problems with reasonable condition numbers. The following theorem gives a bound for its convergence rate. Theorem 2. (CG-error) Let A be a SPD n×n matrix and ei be the sequence of errors generated by the conjugate gradient algorithm. Then, √ K −1 i kei kA ≤ 2 √ ke0 kA . K +1 Here K is the spectral condition number of A. Proof. Consider Gi of the multi-parameter Theorem 2 of Class Notes 10 with τ1 , . . . , τi coming from λ0 = λ1 and λ∞ = λn . Then, Gi e0 = i Y (I − τj A)e0 . j=1 Expanding the right hand side and using Ae0 = r0 , we see that Gi e0 = x − x0 − ζ for some ζ ∈ Ki . Since CG is an implementation of the Krylov minimization (11.3) , √ K −1 i kei kA ≤ kGi e0 kA ≤ 2 √ ke0 kA . K +1 p The above theorem guarantees an accelerated convergence rate (O( (k)) iterations for a fixed reduction) when compared to Gauss-Seidel or Jacobi iterations. In addition, the method does require bounds on the spectrum. Proof of Theorem (CG-equivalence). The proof is by induction. We shall prove that for i = 1, . . . , l0 , (I.1) {p0 , p1 , . . . , pi−1 } forms an A-orthogonal basis for Ki . (I.2) (ei , θ)A = 0 for all θ ∈ Ki . By definition, p0 = r0 and forms a basis for K1 . Moreover, (e1 , r0 )A = 0 by the definition of α0 so (I.1) and (I.2) hold for i = 1. We inductively assume that (I.1) and (I.2) hold for i = k with k < l0 . By the definition of βk−1 , pk is A-orthogonal to pk−1 . For j < k − 1, (11.6) (pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , Apj )A − βk−1 (pk−1 , pj )A . Of the two terms on the right, the first vanishes by Assumption (I.2) with k since Apj ∈ Kk while the second vanishes by assumption (I.1) with k. Thus, pk satisfies the desired orthogonality properties. 7 The validity of (I.1) for k + 1 will follow if we show that pk 6= 0. If pk = 0 then rk = −βk−1 pk−1 ∈ Kk and by (I.1) at k, 0 = (ek , pk−1 )A = (rk , pk−1 ) = −βk−1 (pk−1 , pk−1 ). This implies that rk = 0 and hence ek = 0. This means that e0 ∈ Kk contradicting the assumption that k < l0 . We next verify (I.2) at k+1. As observed earlier, pk and αk are constructed so that ek+1 is A-orthogonal to both pk and pk−1 . To complete the proof, we need only check its orthogonality to pj for j < k − 1. In this case, (11.7) (ek+1 , pj )A = (ek − αk pk , pj )A = (ek , pj )A − αk (pk , pj )A . The first term is zero by the induction assumption (I.2) while the second vanishes by (I.1) for k + 1 (which we proved above). Thus, (I.1) and (I.2) hold for i = 1, 2, . . . , l0 . We already observed that ei = e0 + θ for some θ ∈ Ki . Thus, (I.2) and Lemma 1 implies that xi = x̃i .

Class Notes 11: THE CONJUGATE GRADIENT METHOD

Related documents

Products

Support

Class Notes 11: THE CONJUGATE GRADIENT METHOD

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib