12: CONJUGATE GRADIENT, PRECONDITIONED CONJUGATE GRADIENT

advertisement
12: CONJUGATE GRADIENT, PRECONDITIONED CONJUGATE
GRADIENT
Math 639
(updated: January 2, 2012)
Below, HW indicates that the result is part of the next homework assignment.
We shall study the conjugate gradient algorithm applied to a SPD n × n
matrix A in this class. We start by introducing the so-called Krylov subspace,
Km = Km (A, r0 ) ≡ span{r0 , Ar0 , A2 r0 , . . . , Am−1 r0 }.
It is clear that the dimension of Km is at most m. Sometimes, Km has dimension less than m, for example, if r0 was an eigenvector, then dim(Km )=1,
for m = 1, 2, . . . (why?).
A simple argument by mathematical induction (HW), shows that the
search direction pi is in Ki+1 , for i = 0, 1, . . .. It follows that xi = x0 + θ for
some θ ∈ Ki and ei = e0 − θ (for the same θ).
We note that the dimension of the Krylov space keeps growing until Al r0 ∈
Kl in which case Kl = Kl+1 = Kl+2 · · · (HW). Let l0 be the minimal such
value, i.e., Al0 r0 ∈ Kl0 but Al0 −1 r0 ∈
/ Kl0 −1 . Then there are coefficients,
c0 , . . . , cl0 −1 satisfying
c0 r0 + c1 Ar0 + . . . cl0 −1 Al0 −1 r0 = Al0 r0 .
Since Al0 −1 r0 ∈
/ Kl0 −1 , c0 6= 0 (why?) and so
l0 −1
e0 = A−1 r0 = c−1
r0 − c1 r0 − c2 Ar0 · · · − cl0 −1 Al0 −2 r0 ) ∈ Kl0 .
0 (A
Similar arguments show that l0 is, in fact, the smallest index such that
e0 ∈ Kl0 (HW).
It is natural to consider the best approximation x̃i over vectors of the form
x̃i = x0 + θ for θ ∈ Ki (best approximation with respect to the A-norm),
i.e.,
(12.1)
kx − x̃i kA = min kx − (x0 + ζ)kA .
ζ∈Ki
This can be rewritten
(12.2)
kẽi kA = min ke0 − ζkA
ζ∈Ki
with ẽi = x − x̃i . As e0 is in Kl0 , x̃l0 = x, i.e., we have the solution at the
l0 ’th step (why?).
1
2
The minimization problem with respect to a finite dimensional space associated with a norm which comes from an inner product can always be
reduced to a matrix problem. We illustrate this for the above minimization
problem. This is contained in the following lemma:
Lemma 1. There is a unique solution x̃i = x0 + θ with θ ∈ Ki solving
(12.1). It is characterized as the unique function of this form satisfying
(12.3)
(x − x̃i , ζ)A = 0 for all ζ ∈ Ki .
Proof. We first show that there is a unique function θ ∈ Ki for which (12.3)
holds. Given a basis {v1 , . . . , vl } for the Krylov space Ki , we expand
θ=
l
X
cj vj .
j=1
The condition (12.3) is equivalent to
(x − x0 −
l
X
cj vj , vm )A = 0 for m = 1, . . . , l
j=1
or
l
X
j=1
cj (vj , vm )A = (b − Ax0 , vm ) for m = 1, . . . , l.
This is the same as the matrix problem N c = F with
Nj,m = (vm , vj )A and Fm = (b − Ax0 , vm ),
j, m = 1, . . . , l.
This problem has a unique solution if N is nonsingular. To check this, we
let d ∈ Rl be arbitrary and compute
(N d, d) =
l
X
(Nj,k dk )dj =
=(
k=1
(vk , vj )A dk dj
j,k=1
j,k=1
l
X
l
X
dk v k ,
l
X
dj vj )A = (w, w)A
j=1
P
where w = lj=1 dj vj . Since {v1 , . . . , vl } is a basis, w is nonzero whenever
d is nonzero. It follows from the definiteness property of the inner product
that 0 6= (w, w)A = (N d, d) whenever d 6= 0. Thus, 0 is the only vector for
which N d = 0, i.e., the matrix N is nonsingular and there is a unique θ ∈ Ki
satisfying (12.3).
3
We next show that the unique solution of (12.3) solves the minimization
problem. Indeed, if ζ is in Ki then
kx − x̃i k2A = (x − x̃i , x − [(x0 + ζ) + (θ − ζ)])A
= (x − x̃i , x − (x0 + ζ))A ≤ kx − x̃i kA kx − (x0 + ζ)kA
It immediately follows that x̃i is the minimizer.
The proof will be complete once we show that anything which is a minimizer satisfies (12.3). Suppose that x̃i = x0 + θ solves the minimization
problem. Given ζ ∈ Ki , we consider for t ∈ R,
f (t) = kx − x̃i + tζk2A = (x − x̃i + tζ, x − x̃i + tζ)A .
By using bilinearity of the inner product, we see that
f (t) = (x − x̃i , x − x̃i )A + 2t(x − x̃i , ζ)A + t2 (ζ, ζ)A .
Since x̃i is the minimizer, f ′ (0) = 2(x − x̃i , ζ)A = 0, i.e., (12.3) holds.
Theorem 1. (CG-equivalence) Let l0 be as above. Then for i = 1, 2, . . . , l0 ,
xi defined by the conjugate gradient (CG) algorithm coincides with the best
approximation from the Krylov space, x̃i satisfying (12.1).
The first step in the conjugate gradient method coincides with the steepest
descent algorithm. It follows that the CG algorithm is not a linear iterative
method.
The above theorem shows that the conjugate gradient algorithm provides
a very efficient implementation of the Krylov minimization problem. As
discussed above, at least mathematically, we get the exact solution on the
l0 ’th step (xl0 = x). At this point the CG algorithm breaks down as rl0 =
pl0 = 0 and so the denominators are zero.
Historically, when the conjugate gradient method was discovered, it was
proposed as a direct solver. Since, Kl ⊆ Rn , l0 can at most be n. The above
theorem shows that we get convergence in at most n iterations. However,
it was found that, because of round-off errors, implementations of the CG
method sometimes failed to converge in n iterations. Nevertheless, CG is
very effective when used as an iterative method on problems with reasonable
condition numbers. The following theorem gives a bound for its convergence
rate.
Theorem 2. (CG-error) Let A be a SPD n×n matrix and ei be the sequence
of errors generated by the conjugate gradient algorithm. Then,
√
K −1 i
ke0 kA .
kei kA ≤ 2 √
K +1
Here K is the spectral condition number of A.
4
Proof. Since CG is an implementation of the Krylov minimization, kei kA ≤
ke0 +θkA for any θ ∈ Ki . In particular, Gi of the multi-parameter theorem of
an earlier class (with τ1 , . . . , τi coming from λ0 = λ1 and λ∞ = λn ) satisfies
kGi e0 kA = k
i
Y
(I − τj A)e0 kA = ke0 + θkA
j=1
for some θ ∈ Ki . The theorem follows immediately from the multi-parameter
theorem.
Proof of Theorem (CG-equivalence). The proof is by induction. We shall
prove that for i = 1, . . . , l0 ,
(I.1) {p0 , p1 , . . . , pi−1 } forms an A-orthogonal basis for Ki .
(I.2) (ei , θ)A = 0 for all θ ∈ Ki .
By definition, p0 = r0 and forms a basis for K1 . Moreover, (e1 , r0 )A = 0
by the definition of α0 so (I.1) and (I.2) hold for i = 1.
We inductively assume that (I.1) and (I.2) hold for i = k with k < l0 . By
the definition of βk−1 , pk is A-orthogonal to pk−1 . For j < k − 1,
(12.4)
(pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , Apj )A − βk−1 (pk−1 , pj )A .
Of the two terms on the right, the first vanishes by Assumption (I.2) with k
since Apj ∈ Kk while the second vanishes by assumption (I.1) with k. Thus,
pk satisfies the desired orthogonality properties.
The validity of (I.1) for k + 1 will follow if we show that pk 6= 0. If pk = 0
then rk = −βk−1 pk−1 ∈ Kk and by (I.1) at k,
0 = (ek , pk−1 )A = (rk , pk−1 ) = −βk−1 (pk−1 , pk−1 ).
This implies that rk = 0 and hence ek = 0. This means that e0 ∈ Kk
contradicting the assumption that k < l0 .
We next verify (I.2) at k + 1. As observed in the previous class, pk and
αk are constructed so that ek+1 is A-orthogonal to both pk and pk−1 . To
complete the proof, we need only check its orthogonality to pj for j < k − 1.
In this case,
(12.5)
(ek+1 , pj )A = (ek − αk pk , pj )A = (ek , pj )A − αk (pk , pj )A .
The first term is handled by the induction assumption (I.2) while the second
follows from (I.1) for k + 1 (which we proved above).
Thus, (I.1) and (I.2) hold for i = 1, 2, . . . , l0 . We already observed that
ei = e0 + θ for some θ ∈ Ki . Thus, (I.2) and Lemma 1 implies that xi =
x̃i .
5
Preconditioned conjugate gradient iteration. In the development
and analysis of the conjugate gradient algorithm, we see the interaction of
the matrix A and the inner product. In fact, if you look carefully, you see
that the entire development goes through replacing the ℓ2 inner product
(·, ·) by any other inner product provided that the matrix A is positive definite and self adjoint with respect to new inner product. This observation
immediately leads to a preconditioned conjugate gradient algorithm. Specifically, we assume that we are given two SPD matrices A and B (with B a
preconditioner). As usual, we consider the preconditioned system
(12.6)
BAx = Bb
B −1 -inner
and use the
product as our base inner product (which replaces the
ℓ2 inner product). We now apply CG to the preconditioned system (12.6)
with the B −1 -inner product to obtain:
Algorithm 1. (Preconditioned Conjugate Gradient, Version 1). Let A and
B be SPD n × n matrices and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the
right hand side) be given. Start by setting p0 = r0 = Bb − BAx0 . Then for
i = 0, 1, . . ., define
xi+1 = xi + αi pi ,
(12.7)
where αi =
(ri , pi )B −1
(BApi , pi )B −1
ri+1 = ri − αi BApi ,
pi+1 = ri+1 − βi pi ,
where βi =
(ri+1 , BApi )B −1
.
(BApi , pi )B −1
The above algorithm is not completely satisfactory. The reason being is
that the preconditioner may be defined by a fairly complicated process and so
its inverse may not be computationally available. Except for the numerator
in the definition of αi , the B −1 cancels with an application of B. We can
deal with this troublesome term by introducing an intermediate variable, r̃i
satisfying r̃i = b − Axi . Then ri = Br̃i and the algorithm becomes:
Algorithm 2. (Preconditioned Conjugate Gradient). Let A and B be SPD
n × n matrices and x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand
side) be given. Start by setting r̃0 = b − Ax0 , p0 = r0 = Br̃0 . Then for
i = 0, 1, . . ., define
(12.8)
(r̃i , pi )
(Api , pi )
xi+1 = xi + αi pi ,
where αi =
r̃i+1 = r̃i − αi Api ,
ri+1 = Br̃i+1
pi+1 = ri+1 − βi pi ,
where βi =
(ri+1 , Api )
.
(Api , pi )
6
We note that in the above algorithm, there is exactly one evaluation of
B and one evaluation of A per iterative step after startup. The original
CG method minimized the error in the A-norm which was defined using the
base inner product, (·, ·), i.e. (·, ·)A = (A·, ·). The preconditioned conjugate
gradient method minimizes the error in the operator inner product defined
in terms of the base inner product (·, ·)B −1 and leads to
((v, w))BA = (BAv, w)B −1 = (Av, w) = (v, w)A .
In fact, the use of the B −1 -inner product was motivated by this observation,
i.e., we get the best approximation in the Krylov space in the usual A-inner
product (See the remark below).
The analysis of the preconditioned version is identical to the original CG
algorithm except the minimization is over the preconditioned Krylov space,
Ki = Ki (BA, r0 ).
We again prove (I.1) and (I.2) by induction. The critical property which
makes everything work is that BA is self adjoint with respect to the A inner
product so (12.4) is replaced by
(pk , pj )A = (rk − βk−1 pk−1 , pj )A = (ek , BApj )A − βk−1 (pk−1 , pj )A .
This shows that for the preconditioned algorithm,
kx − xi kA =
min
ζ∈Ki (BA,r0 )
kx − (x0 − ζ)kA .
Applying the multi-parameter preconditioned theorem of last class then gives
the following theorem.
Theorem 3. (PCG-error) Let A and B be SPD n × n matrices and ei be
the sequence of errors generated by the preconditioned conjugate gradient
algorithm. Then,
√
K −1 i
kei kA ≤ 2 √
ke0 kA .
K +1
Here K is the spectral condition number of BA.
Remark 1. Alternative preconditioned conjugate gradient algorithms can
be defined. For example, one can use the A-inner product as a base. The
Krylov space remains the same, i.e., Kl (BA, r0 ). The only difference is that
the error minimization is with respect to the ABA-inner product, i.e.
kx − xi kABA =
min
ζ∈Ki (BA,r0 )
kx − (x0 − ζ)kABA .
This results in the corresponding error bound
√
K −1 i
kei kABA ≤ 2 √
ke0 kABA .
K +1
Download