12: CONJUGATE GRADIENT, PRECONDITIONED CONJUGATE GRADIENT Math 639 (updated: January 2, 2012) Below, HW indicates that the result is part of the next homework assignment. We shall study the conjugate gradient algorithm applied to a SPD n × n matrix A in this class. We start by introducing the so-called Krylov subspace, Km = Km (A, r0 ) ≡ span{r0 , Ar0 , A2 r0 , . . . , Am−1 r0 }. It is clear that the dimension of Km is at most m. Sometimes, Km has dimension less than m, for example, if r0 was an eigenvector, then dim(Km )=1, for m = 1, 2, . . . (why?). A simple argument by mathematical induction (HW), shows that the search direction pi is in Ki+1 , for i = 0, 1, . . .. It follows that xi = x0 + θ for some θ ∈ Ki and ei = e0 − θ (for the same θ). We note that the dimension of the Krylov space keeps growing until Al r0 ∈ Kl in which case Kl = Kl+1 = Kl+2 · · · (HW). Let l0 be the minimal such value, i.e., Al0 r0 ∈ Kl0 but Al0 −1 r0 ∈ / Kl0 −1 . Then there are coefficients, c0 , . . . , cl0 −1 satisfying c0 r0 + c1 Ar0 + . . . cl0 −1 Al0 −1 r0 = Al0 r0 . Since Al0 −1 r0 ∈ / Kl0 −1 , c0 6= 0 (why?) and so l0 −1 e0 = A−1 r0 = c−1 r0 − c1 r0 − c2 Ar0 · · · − cl0 −1 Al0 −2 r0 ) ∈ Kl0 . 0 (A Similar arguments show that l0 is, in fact, the smallest index such that e0 ∈ Kl0 (HW). It is natural to consider the best approximation x̃i over vectors of the form x̃i = x0 + θ for θ ∈ Ki (best approximation with respect to the A-norm), i.e., (12.1) kx − x̃i kA = min kx − (x0 + ζ)kA . ζ∈Ki This can be rewritten (12.2) kẽi kA = min ke0 − ζkA ζ∈Ki with ẽi = x − x̃i . As e0 is in Kl0 , x̃l0 = x, i.e., we have the solution at the l0 ’th step (why?). 1 2 The minimization problem with respect to a finite dimensional space associated with a norm which comes from an inner product can always be reduced to a matrix problem. We illustrate this for the above minimization problem. This is contained in the following lemma: Lemma 1. There is a unique solution x̃i = x0 + θ with θ ∈ Ki solving (12.1). It is characterized as the unique function of this form satisfying (12.3) (x − x̃i , ζ)A = 0 for all ζ ∈ Ki . Proof. We first show that there is a unique function θ ∈ Ki for which (12.3) holds. Given a basis {v1 , . . . , vl } for the Krylov space Ki , we expand θ= l X cj vj . j=1 The condition (12.3) is equivalent to (x − x0 − l X cj vj , vm )A = 0 for m = 1, . . . , l j=1 or l X j=1 cj (vj , vm )A = (b − Ax0 , vm ) for m = 1, . . . , l. This is the same as the matrix problem N c = F with Nj,m = (vm , vj )A and Fm = (b − Ax0 , vm ), j, m = 1, . . . , l. This problem has a unique solution if N is nonsingular. To check this, we let d ∈ Rl be arbitrary and compute (N d, d) = l X (Nj,k dk )dj = =( k=1 (vk , vj )A dk dj j,k=1 j,k=1 l X l X dk v k , l X dj vj )A = (w, w)A j=1 P where w = lj=1 dj vj . Since {v1 , . . . , vl } is a basis, w is nonzero whenever d is nonzero. It follows from the definiteness property of the inner product that 0 6= (w, w)A = (N d, d) whenever d 6= 0. Thus, 0 is the only vector for which N d = 0, i.e., the matrix N is nonsingular and there is a unique θ ∈ Ki satisfying (12.3). 3 We next show that the unique solution of (12.3) solves the minimization problem. Indeed, if ζ is in Ki then kx − x̃i k2A = (x − x̃i , x − [(x0 + ζ) + (θ − ζ)])A = (x − x̃i , x − (x0 + ζ))A ≤ kx − x̃i kA kx − (x0 + ζ)kA It immediately follows that x̃i is the minimizer. The proof will be complete once we show that anything which is a minimizer satisfies (12.3). Suppose that x̃i = x0 + θ solves the minimization problem. Given ζ ∈ Ki , we consider for t ∈ R, f (t) = kx − x̃i + tζk2A = (x − x̃i + tζ, x − x̃i + tζ)A . By using bilinearity of the inner product, we see that f (t) = (x − x̃i , x − x̃i )A + 2t(x − x̃i , ζ)A + t2 (ζ, ζ)A . Since x̃i is the minimizer, f ′ (0) = 2(x − x̃i , ζ)A = 0, i.e., (12.3) holds. Theorem 1. (CG-equivalence) Let l0 be as above. Then for i = 1, 2, . . . , l0 , xi defined by the conjugate gradient (CG) algorithm coincides with the best approximation from the Krylov space, x̃i satisfying (12.1). The first step in the conjugate gradient method coincides with the steepest descent algorithm. It follows that the CG algorithm is not a linear iterative method. The above theorem shows that the conjugate gradient algorithm provides a very efficient implementation of the Krylov minimization problem. As discussed above, at least mathematically, we get the exact solution on the l0 ’th step (xl0 = x). At this point the CG algorithm breaks down as rl0 = pl0 = 0 and so the denominators are zero. Historically, when the conjugate gradient method was discovered, it was proposed as a direct solver. Since, Kl ⊆ Rn , l0 can at most be n. The above theorem shows that we get convergence in at most n iterations. However, it was found that, because of round-off errors, implementations of the CG method sometimes failed to converge in n iterations. Nevertheless, CG is very effective when used as an iterative method on problems with reasonable condition numbers. The following theorem gives a bound for its convergence rate. Theorem 2. (CG-error) Let A be a SPD n×n matrix and ei be the sequence of errors generated by the conjugate gradient algorithm. Then, √ K −1 i ke0 kA . kei kA ≤ 2 √ K +1 Here K is the spectral condition number of A. 4 Proof. Since CG is an implementation of the Krylov minimization, kei kA ≤ ke0 +θkA for any θ ∈ Ki . In particular, Gi of the multi-parameter theorem of an earlier class (with τ1 , . . . , τi coming from λ0 = λ1 and λ∞ = λn ) satisfies kGi e0 kA = k i Y (I − τj A)e0 kA = ke0 + θkA j=1 for some θ ∈ Ki . The theorem follows immediately from the multi-parameter theorem. Proof of Theorem (CG-equivalence). The proof is by induction. We shall prove that for i = 1, . . . , l0 , (I.1) {p0 , p1 , . . . , pi−1 } forms an A-orthogonal basis for Ki . (I.2) (ei , θ)A = 0 for all θ ∈ Ki . By definition, p0 = r0 and forms a basis for K1 . Moreover, (e1 , r0 )A = 0 by the definition of α0 so (I.1) and (I.2) hold for i = 1. We inductively assume that (I.1) and (I.2) hold for i = k with k < l0 . By the definition of βk−1 , pk is A-orthogonal to pk−1 . For j < k − 1, (12.4) (pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , Apj )A − βk−1 (pk−1 , pj )A . Of the two terms on the right, the first vanishes by Assumption (I.2) with k since Apj ∈ Kk while the second vanishes by assumption (I.1) with k. Thus, pk satisfies the desired orthogonality properties. The validity of (I.1) for k + 1 will follow if we show that pk 6= 0. If pk = 0 then rk = −βk−1 pk−1 ∈ Kk and by (I.1) at k, 0 = (ek , pk−1 )A = (rk , pk−1 ) = −βk−1 (pk−1 , pk−1 ). This implies that rk = 0 and hence ek = 0. This means that e0 ∈ Kk contradicting the assumption that k < l0 . We next verify (I.2) at k + 1. As observed in the previous class, pk and αk are constructed so that ek+1 is A-orthogonal to both pk and pk−1 . To complete the proof, we need only check its orthogonality to pj for j < k − 1. In this case, (12.5) (ek+1 , pj )A = (ek − αk pk , pj )A = (ek , pj )A − αk (pk , pj )A . The first term is handled by the induction assumption (I.2) while the second follows from (I.1) for k + 1 (which we proved above). Thus, (I.1) and (I.2) hold for i = 1, 2, . . . , l0 . We already observed that ei = e0 + θ for some θ ∈ Ki . Thus, (I.2) and Lemma 1 implies that xi = x̃i . 5 Preconditioned conjugate gradient iteration. In the development and analysis of the conjugate gradient algorithm, we see the interaction of the matrix A and the inner product. In fact, if you look carefully, you see that the entire development goes through replacing the ℓ2 inner product (·, ·) by any other inner product provided that the matrix A is positive definite and self adjoint with respect to new inner product. This observation immediately leads to a preconditioned conjugate gradient algorithm. Specifically, we assume that we are given two SPD matrices A and B (with B a preconditioner). As usual, we consider the preconditioned system (12.6) BAx = Bb B −1 -inner and use the product as our base inner product (which replaces the ℓ2 inner product). We now apply CG to the preconditioned system (12.6) with the B −1 -inner product to obtain: Algorithm 1. (Preconditioned Conjugate Gradient, Version 1). Let A and B be SPD n × n matrices and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand side) be given. Start by setting p0 = r0 = Bb − BAx0 . Then for i = 0, 1, . . ., define xi+1 = xi + αi pi , (12.7) where αi = (ri , pi )B −1 (BApi , pi )B −1 ri+1 = ri − αi BApi , pi+1 = ri+1 − βi pi , where βi = (ri+1 , BApi )B −1 . (BApi , pi )B −1 The above algorithm is not completely satisfactory. The reason being is that the preconditioner may be defined by a fairly complicated process and so its inverse may not be computationally available. Except for the numerator in the definition of αi , the B −1 cancels with an application of B. We can deal with this troublesome term by introducing an intermediate variable, r̃i satisfying r̃i = b − Axi . Then ri = Br̃i and the algorithm becomes: Algorithm 2. (Preconditioned Conjugate Gradient). Let A and B be SPD n × n matrices and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand side) be given. Start by setting r̃0 = b − Ax0 , p0 = r0 = Br̃0 . Then for i = 0, 1, . . ., define (12.8) (r̃i , pi ) (Api , pi ) xi+1 = xi + αi pi , where αi = r̃i+1 = r̃i − αi Api , ri+1 = Br̃i+1 pi+1 = ri+1 − βi pi , where βi = (ri+1 , Api ) . (Api , pi ) 6 We note that in the above algorithm, there is exactly one evaluation of B and one evaluation of A per iterative step after startup. The original CG method minimized the error in the A-norm which was defined using the base inner product, (·, ·), i.e. (·, ·)A = (A·, ·). The preconditioned conjugate gradient method minimizes the error in the operator inner product defined in terms of the base inner product (·, ·)B −1 and leads to ((v, w))BA = (BAv, w)B −1 = (Av, w) = (v, w)A . In fact, the use of the B −1 -inner product was motivated by this observation, i.e., we get the best approximation in the Krylov space in the usual A-inner product (See the remark below). The analysis of the preconditioned version is identical to the original CG algorithm except the minimization is over the preconditioned Krylov space, Ki = Ki (BA, r0 ). We again prove (I.1) and (I.2) by induction. The critical property which makes everything work is that BA is self adjoint with respect to the A inner product so (12.4) is replaced by (pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , BApj )A − βk−1 (pk−1 , pj )A . This shows that for the preconditioned algorithm, kx − xi kA = min ζ∈Ki (BA,r0 ) kx − (x0 − ζ)kA . Applying the multi-parameter preconditioned theorem of last class then gives the following theorem. Theorem 3. (PCG-error) Let A and B be SPD n × n matrices and ei be the sequence of errors generated by the preconditioned conjugate gradient algorithm. Then, √ K −1 i kei kA ≤ 2 √ ke0 kA . K +1 Here K is the spectral condition number of BA. Remark 1. Alternative preconditioned conjugate gradient algorithms can be defined. For example, one can use the A-inner product as a base. The Krylov space remains the same, i.e., Kl (BA, r0 ). The only difference is that the error minimization is with respect to the ABA-inner product, i.e. kx − xi kABA = min ζ∈Ki (BA,r0 ) kx − (x0 − ζ)kABA . This results in the corresponding error bound √ K −1 i kei kABA ≤ 2 √ ke0 kABA . K +1