Class Notes 11: THE CONJUGATE GRADIENT METHOD

advertisement
Class Notes 11: THE CONJUGATE GRADIENT METHOD
Math 639d
Due Date: Nov. 13
(updated: November 6, 2014)
In this set of notes, A will be an symmetric and positive definite real
n × n matrix. Everything that is to be done also works for positive definite
Hermitian complex n × n matrices.
The steepest descent method makes the error ei+1 A-orthogonal ri . Unfortunately, if this step results in little change, then ri+1 lies in almost the
same direction as ri so the step to ei+2 is not very effective either since ei+1
is already A-orthogonal to ri and almost A-orthogonal to ri+1 . This is a
shortcoming of the steepest descent method.
The fix is simple. We generalize the algorithm and let pi be the direction
which we use to compute ei+1 . We make ei+1 A-orthogonal to pi . The idea
is to preserve this orthogonality when going to ei+2 . Since ei+1 is already Aorthogonal to pi , ei+2 will remain A-orthogonal to pi only if our new search
direction pi+1 is also A-orthogonal to pi . Thus, instead of using ri+1 as our
next search direction, we use the component of ri+1 which is A-orthogonal
to pi , i.e.,
pi+1 = ri+1 − βi pi
where βi is chosen so that
(pi+1 , pi )A = 0.
A simple computation gives
βi =
(ri+1 , pi )A
.
(pi , pi )A
We continue by making ei+2 A-orthogonal to pi+1 , i.e.,
xi+2 = xi+1 + αi+1 pi+1
with αi+1 satisfying
(ei+1 − αi+1 pi+1 , pi+1 )A = 0 or αi+1 =
(ri+1 , pi+1 )
.
(Api+1 , pi+1 )
As both ei+1 and pi+1 are A-orthogonal to pi and ei+2 = ei+1 − αi+1 pi+1 ,
ei+2 is A-orthogonal to both pi and pi+1 . The above discussion leads to the
following algorithm.
1
2
Algorithm 1. (Conjugate Gradient). Let A be a SPD n × n real matrix
and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand side) be given.
Start by setting p0 = r0 = b − Ax0 . Then for i = 0, 1, . . ., define
(ri , pi )
xi+1 = xi + αi pi ,
where αi =
,
(Api , pi )
ri+1 = ri − αi Api ,
(11.1)
pi+1 = ri+1 − βi pi ,
where βi =
(ri+1 , Api )
.
(Api , pi )
Remarks:
• The above algorithm works for complex valued Hermitian positive
definite matrices A provided that one uses the sesquilinear inner
product (u, v) := u · v̄ with v̄ denoting the complex conjugate of v.
• Notice that we have moved the matrix-vector evaluation in the above
inner products so that it is clear that only one matrix-vector evaluation, namely Api , is required per iterative step after start up.
• The first step in the conjugate gradient method coincides with the
steepest descent algorithm. It follows that the CG algorithm is not
a linear iterative method.
We illustrate pseudo code for the conjugate gradient algorithm below: We
have implicitly assumed that A(X) is a routine which returns the result of
A applied to X and IP (·, ·) returns the result of the inner product. Here k
is the number of iterations, X is x0 on input and X is xk on return.
F U N CT ION CG(X, B, A, k, IP )
R = P = B − A(X);
F OR j = 1, 2, . . . , k DO {
AP = A(P ); al = IP (R, P )/IP (P, AP );
X = X + al ∗ P ; R = R − al ∗ AP ;
be = IP (R, AP )/IP (P, AP );
P = R − be ∗ P ;
}
RET U RN
EN D
The above code is somewhat terse but was included to illustrate the fact
that one can implement CG with exactly 3 extra vectors, R, P , and AP .
An actual code would include extra checks for consistency and convergence.
For example, a tolerance might be passed and the residual might be tested
against it causing the routine to return when the desired tolerance was
achieved. Also, for consistency, one would check to see that (Ap, p) > 0
3
for when this in negative or zero, it is a sure sign that the matrix is either
no good (not SPD) or that you have iterated to convergence (we shall see
below that the solution has been reached when (Ap, p) = 0).
Below, HW indicates that the result is part of the next homework assignment.
To analyze the conjugate gradient method, we start by introducing the
so-called Krylov subspace,
Km := Km (A, r0 ) := span{r0 , Ar0 , A2 r0 , . . . , Am−1 r0 }.
It is clear that the dimension of Km is at most m. Sometimes, Km has dimension less than m, for example, if r0 was an eigenvector, then dim(Km )=1,
for m = 1, 2, . . ., since Aj r0 = λj r0 .
A simple argument by mathematical induction (HW), shows that the
search direction pi is in Ki+1 , for i = 0, 1, . . .. It follows that xi = x0 + θ for
some θ ∈ Ki and ei = e0 − θ (for the same θ).
We note that the dimension of the Krylov space keeps growing until Al r0 ∈
Kl in which case Kl = Kl+1 = Kl+2 · · · (HW). Let l0 be the minimal such
/ Kl0 −1 . Then there are coefficients,
value, i.e., Al0 r0 ∈ Kl0 but Al0 −1 r0 ∈
c0 , . . . , cl0 −1 satisfying
(11.2)
c0 r0 + c1 Ar0 + . . . cl0 −1 Al0 −1 r0 = Al0 r0 .
If c0 = 0, multiplying (11.2) by A−1 would imply that Al0 −1 r0 ∈ Kl0 −1 . As
we have assumed that this is not the case, we must have c0 6= 0 and so
l0 −1
e0 = A−1 r0 = c−1
r0 − c1 r0 − c2 Ar0 · · · − cl0 −1 Al0 −2 r0 ) ∈ Kl0 .
0 (A
Similar arguments show that l0 is, in fact, the smallest index such that
e0 ∈ Kl0 (HW).
It is natural to consider the best approximation with respect to the A−norm
to x over vectors of the form x̃i = x0 + θ for θ ∈ Ki i.e.,
(11.3)
kx − x̃i kA := kx − (x0 + θ)kA := min kx − (x0 + ζ)kA .
ζ∈Ki
This can be rewritten
(11.4)
kẽi kA = min ke0 − ζkA
ζ∈Ki
with ẽi = x − x̃i . When i = l0 , we may take ζ = e0 ∈ Kl0 to conclude that
ẽl0 = 0 and hence xl0 = x, i.e., we have the solution at the l0 ’th step.
The minimization problem with respect to a finite dimensional space associated with a norm which comes from an inner product can always be
4
reduced to a matrix problem. We illustrate this for the above minimization
problem in the following lemma:
Lemma 1. There is a unique solution θ ∈ Ki solving the minimization problem (11.3). It is characterized as the unique function of this form satisfying
(11.5)
(x − x0 − θ, ζ)A = (x − x̃i , ζ)A = 0 for all ζ ∈ Ki .
Proof. We first show that there is a unique function θ ∈ Ki for which (11.5)
holds. Given a basis {v1 , . . . , vl } for the Krylov space Ki , we expand
θ=
l
X
cj vj .
j=1
The condition (11.5) is equivalent to
l
X
= 0 for m = 1, . . . , l
x − x0 −
cj vj , vm
A
j=1
or, using the bilinearity of (·, ·)A ,
l
X
j=1
cj (vj , vm )A = (b − Ax0 , vm ) for m = 1, . . . , l.
This is the same as the matrix problem N c = F with N ∈ Rl×l and F, c ∈ Rl
and
Nm,j = (vj , vm )A and Fm = (b − Ax0 , vm ),
j, m = 1, . . . , l.
This problem has a unique solution if N is nonsingular. To check this, we
let d ∈ Rl be arbitrary and compute
(N d, d) =
l
X
(Nj,k dk )dj =
=
k=1
(vk , vj )A dk dj
j,k=1
j,k=1
X
l
l
X
dk v k ,
l
X
j=1
dj v j
= (w, w)A
A
P
where w = lj=1 dj vj . Since {v1 , . . . , vl } is a basis, w is nonzero whenever
d is nonzero. It follows from the definiteness property of the inner product
that 0 6= (w, w)A = (N d, d) whenever d 6= 0. Thus, 0 is the only vector for
which N d = 0, i.e., the matrix N is nonsingular and there is a unique θ ∈ Ki
satisfying (11.5).
5
We next show that the unique solution θ of (11.5) is a solution to the
minimization problem. Indeed, if ζ is in Ki then
kx − x̃i k2A = (x − x̃i , x − [(x0 + ζ) + (θ − ζ)])A
= (x − x̃i , x − (x0 + ζ))A ≤ kx − x̃i kA kx − (x0 + ζ)kA
where we used the Cauchy-Schwarz inequality for the inequality above. It
follows that
kx − x̃i kA ≤ kx − (x0 + ζ)kA .
As this holds for all ζ in ki , θ is a minimizer.
The proof will be complete once we show that anything which is a minimizer satisfies (11.5) and hence (11.5) gives rise to the unique minimizer.
Suppose that θ solves the minimization problem and x̃i = x0 + θ. Given
ζ ∈ Ki , we consider for t ∈ R,
f (t) = kx − x̃i + tζk2A = (x − x̃i + tζ, x − x̃i + tζ)A .
By using bilinearity of the inner product, we see that
f (t) = (x − x̃i , x − x̃i )A + 2t(x − x̃i , ζ)A + t2 (ζ, ζ)A .
Since x̃i is the minimizer, f ′ (0) = 2(x − x̃i , ζ)A = 0, i.e., (11.5) holds. p
The proof of the above lemma shows one way of solving the minimization
problem, i.e., constructing the matrix N , solving N c = F and setting
x̃i = x0 +
l
X
cj vj .
j=0
The following theorem shows that the conjugate gradient algorithm provides
a much more elegant solution.
Theorem 1. (CG-equivalence) Let l0 be as above. Then for i = 1, 2, . . . , l0 ,
xi defined by the conjugate gradient (CG) algorithm coincides with the solution x̃i of the minimization problem (11.3).
The above theorem shows that the conjugate gradient algorithm provides
a very efficient implementation of the Krylov minimization problem (11.3).
As discussed above, at least mathematically, we get the exact solution on
the l0 ’th step (xl0 = x). At this point the CG algorithm breaks down as
rl0 = pl0 = 0 and so the denominators are zero.
Historically, when the conjugate gradient method was discovered, it was
proposed as a direct solver. Since, Kl ⊆ Rn , l0 can at most be n. The above
theorem shows that we get convergence in at most l0 iterations. However,
it was found that, because of round-off errors, implementations of the CG
6
method sometimes failed to converge in n iterations. Nevertheless, CG is
very effective when used as an iterative method on problems with reasonable
condition numbers. The following theorem gives a bound for its convergence
rate.
Theorem 2. (CG-error) Let A be a SPD n×n matrix and ei be the sequence
of errors generated by the conjugate gradient algorithm. Then,
√
K −1 i
kei kA ≤ 2 √
ke0 kA .
K +1
Here K is the spectral condition number of A.
Proof. Consider Gi of the multi-parameter Theorem 2 of Class Notes 10 with
τ1 , . . . , τi coming from λ0 = λ1 and λ∞ = λn . Then,
Gi e0 =
i
Y
(I − τj A)e0 .
j=1
Expanding the right hand side and using Ae0 = r0 , we see that Gi e0 =
x − x0 − ζ for some ζ ∈ Ki . Since CG is an implementation of the Krylov
minimization (11.3) ,
√
K −1 i
kei kA ≤ kGi e0 kA ≤ 2 √
ke0 kA .
K +1
p
The above theorem guarantees an accelerated convergence rate (O( (k))
iterations for a fixed reduction) when compared to Gauss-Seidel or Jacobi
iterations. In addition, the method does require bounds on the spectrum.
Proof of Theorem (CG-equivalence). The proof is by induction. We shall
prove that for i = 1, . . . , l0 ,
(I.1) {p0 , p1 , . . . , pi−1 } forms an A-orthogonal basis for Ki .
(I.2) (ei , θ)A = 0 for all θ ∈ Ki .
By definition, p0 = r0 and forms a basis for K1 . Moreover, (e1 , r0 )A = 0
by the definition of α0 so (I.1) and (I.2) hold for i = 1.
We inductively assume that (I.1) and (I.2) hold for i = k with k < l0 . By
the definition of βk−1 , pk is A-orthogonal to pk−1 . For j < k − 1,
(11.6)
(pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , Apj )A − βk−1 (pk−1 , pj )A .
Of the two terms on the right, the first vanishes by Assumption (I.2) with k
since Apj ∈ Kk while the second vanishes by assumption (I.1) with k. Thus,
pk satisfies the desired orthogonality properties.
7
The validity of (I.1) for k + 1 will follow if we show that pk 6= 0. If pk = 0
then rk = −βk−1 pk−1 ∈ Kk and by (I.1) at k,
0 = (ek , pk−1 )A = (rk , pk−1 ) = −βk−1 (pk−1 , pk−1 ).
This implies that rk = 0 and hence ek = 0. This means that e0 ∈ Kk
contradicting the assumption that k < l0 .
We next verify (I.2) at k+1. As observed earlier, pk and αk are constructed
so that ek+1 is A-orthogonal to both pk and pk−1 . To complete the proof,
we need only check its orthogonality to pj for j < k − 1. In this case,
(11.7)
(ek+1 , pj )A = (ek − αk pk , pj )A = (ek , pj )A − αk (pk , pj )A .
The first term is zero by the induction assumption (I.2) while the second
vanishes by (I.1) for k + 1 (which we proved above).
Thus, (I.1) and (I.2) hold for i = 1, 2, . . . , l0 . We already observed that
ei = e0 + θ for some θ ∈ Ki . Thus, (I.2) and Lemma 1 implies that xi =
x̃i .
Download