11: STEEPEST DESCENT AND CONJUGATE GRADIENT

advertisement
11: STEEPEST DESCENT AND CONJUGATE GRADIENT
Math 639
(updated: January 2, 2012)
Suppose that A is a SPD n × n real matrix and, as usual, we consider
iteratively solving Ax = b. By now, you should understand that the goal of
any iterative method is to drive down (a norm of) the error ei = x − xi as
rapidly as possible. We consider a method of the following form:
xi+1 = xi + αi pi .
Here pi ∈ Rn is a “search direction” while αi is a real number which we are
free to choose. It is immediate (subtract this equation from x = x) that
(11.1)
ei+1 = ei − αi pi .
We denote the “residual” by ri = b − Axi . A simple manipulation shows
that
ri = Aei and ri+1 = ri − αi Api .
The first method that we shall develop takes pi to be the residual ri . The
idea is then to try to find the best possible choice for αi . Ideally, we should
choose αi so that it results in the maximum error reduction, i.e., kei+1 k
should be as small as possible. For arbitrary norms, this goal is not a viable
one. The problem is that we cannot assume that we know ei at any step
of the iteration. Indeed, since we always have xi available, knowing ei is
tantamount to knowing the solution since x = xi + ei .
It is instructive to see what happens with the wrong choice of norm.
Suppose that we attempt to choose αi so that kei+1 kℓ2 is minimal, i.e.,
kei+1 kℓ2 = min kei − αri kℓ2 .
α∈R
The above problem can be solved geometrically and its solution is illustrated
in Figure 1. Clearly, αi should be chosen so that the error ei+1 is orthogonal
to ri , i.e.,
(ei+1 , ri ) = 0.
Here (v, w) ≡ v · w denotes the dot inner product on Rn . A simple algebraic
manipulation using the properties of the inner product and (11.1) gives
(11.2)
(ei − αi ri , ri ) = 0 or αi =
(ei , ri )
.
(ri , ri )
Of course, this method is not computable as we do not know ei so the
numerator in the definition of αi in (11.2) is not available.
1
2
ei
e i+1
αi r i
Figure 1. Minimal error
We can fix up the above method by introducing a different norm, actually,
we introduce a different inner product. Recall, that from earlier classes, inner
products not only provide norms but they give rise to a (different) notion
of angle. We shall get a computable algorithm by replacing the dot-inner
product above with the A-inner product, i.e., we define
(11.3)
kei+1 kA = min kei − αri kA .
α∈R
The solution of this problem is to make ei+1 A-orthogonal to ri , i.e.,
(ei+1 , ri )A = 0.
Repeating the above computations (but with the A-inner product) gives
(ei − αi ri , ri )A = 0 or αi =
(11.4)
αi =
(Aei , ri )
(ei , ri )A
, i.e.,
=
(ri , ri )A
(Ari , ri )
(ri , ri )
.
(Ari , ri )
We have now obtained a computable method. Clearly, the residual ri =
b − Axi and αi are computable without explicitly knowing x or ei . We can
easily check that this choice of αi solves (11.3). Indeed, by A-orthogonality
3
and the Schwarz inequality,
(11.5)
kei+1 k2A = (ei+1 , ei+1 )A = (ei+1 , ei − αri + (αi − α)ri )A
= (ei+1 , ei − αri )A ≤ kei+1 kA kei − αri kA
holds for any α ∈ R. Clearly if kei+1 kA = 0 then (11.3) holds. Otherwise,
(11.3) follows by dividing (11.5) by kei+1 kA .
The algorithm which we have just derived is known as the steepest descent
method and is summarized in the following:
Algorithm 1. (Steepest Descent). Let A be a SPD n × n matrix. Given an
initial iterate x0 , define for i = 0, 1, . . .,
xi+1 = xi + αi ri ,
ri = b − Axi ,
and
(11.6)
αi =
(ri , ri )
.
(Ari , ri )
Proposition 1. Let A be a SPD n × n matrix and {ei } be the sequence of
errors corresponding to the steepest descent algorithm. Then
K −1
kei kA
kei+1 kA ≤
K +1
where K is the spectral condition number of A.
Proof. Since ei+1 is the minimizer
kei+1 kA ≤ k(I − τ A)ei kA
for any real τ . Taking τ = 2/(λ1 + λn ) as in the proposition of Class 7 and
applying that proposition completes the proof.
Remark 1. Note that λ1 and λn only appear in the analysis. We do not need
any eigenvalue estimates for implementation of the steepest descent method.
Remark 2. As already mentioned, it is not practical to attempt to make
the optimal choice with respect to other norms as the error is not explicitly known. Alternatively, at least one application of A can eliminate this
drawback, for example, one could design a method which minimized
kAei kℓ2 .
One could also propose to minimize some other norm, i.e.,
kAei kℓ∞ .
Although this is feasible, since the ℓ∞ norm does not come from an inner
product, the computation of the parameter αi ends up being a difficult nonlinear problem.
4
It is interesting to note that the Steepest Descent Method is the first
example (in this course) of an iterative method that is not linear. Note that
ei+1 can be expressed directly from ei (without knowing xi or b) since one
simply substitutes ri = Aei to compute αi and uses
ei+1 = ei − αi ri .
Thus, there is a mapping ei → ei+1 however it is NOT LINEAR. This can
be illustrated by considering the 2 × 2 matrix
1 0
A=
.
0 2
For either e10 = (1, 0)t or e20 = (0, 1)t , a direct computation gives ej1 = (0, 0)t ,
for j = 1, 2 (do it!). Here ej1 is the error after one step of steepest descent is
applied to ej0 . In contrast, for e0 = e10 + e20 = (1, 1)t , we find
5
1
1
Ae0 = r0 =
,
Ar0 =
,
α0 =
2
4
9
4/9
0
0
e1 =
6=
+
= e11 + e21 .
−1/9
0
0
The steepest descent method makes the error ei+1 A-orthogonal ri . Unfortunately, if this step results in little change, then ri+1 lies in almost the
same direction as ri so the step to ei+2 is not very effective either since ei+1
is already A-orthogonal to ri and almost A-orthogonal to ri+1 . This is a
shortcoming of the steepest descent method.
The fix is simple. We generalize the algorithm and let pi be the direction
which we use to compute ei+1 . As usual, we make ei+1 A-orthogonal to pi .
The idea is to preserve this orthogonality when going to ei+2 . Since ei+1 is
already A-orthogonal to pi , ei+2 will remain A-orthogonal to pi only if our
new search direction pi+1 is also A-orthogonal to pi . Thus, instead of using
ri+1 as our next search direction, we use the component of ri+1 which is
A-orthogonal to pi , i.e.,
pi+1 = ri+1 − βi pi
where βi is chosen so that
(pi+1 , pi )A = 0.
A simple computation gives (do it!)
(ri+1 , pi )A
.
(pi , pi )A
A-orthogonal to pi+1 , i.e.,
βi =
We continue by making ei+2
xi+2 = xi+1 + αi+1 pi+1
5
with αi+1 satisfying
(ri+1 , pi+1 )
.
(Api+1 , pi+1 )
As both ei+1 and pi+1 are A-orthogonal to pi and ei+2 = ei+1 − αi+1 pi+1 ,
ei+2 is A-orthogonal to both pi and pi+1 . The above discussion leads to the
following algorithm.
(ei+1 − αi+1 pi+1 , pi+1 )A = 0 or αi+1 =
Algorithm 2. (Conjugate Gradient). Let A be a SPD n × n matrix and
x0 ∈ Rn (the initial iterate) and b ∈ Rn (the right hand side) be given. Start
by setting p0 = r0 = b − Ax0 . Then for i = 0, 1, . . ., define
(ri , pi )
xi+1 = xi + αi pi ,
where αi =
(Api , pi )
ri+1 = ri − αi Api ,
(11.7)
pi+1 = ri+1 − βi pi ,
where βi =
(ri+1 , Api )
.
(Api , pi )
Notice that we have moved the matrix-vector evaluation in the above inner
products so that it is clear that only one matrix-vector evaluation, namely
Api , is required per iterative step after startup.
We illustrate pseudo code for the conjugate gradient algorithm below: We
have implicitly assumed that A(X) is a routine which returns the result of
A applied to X and IP (·, ·) returns the result of the inner product. Here k
is the number of iterations, X is x0 on input and X is xk on return.
F U N CT ION CG(X, B, A, k, IP )
R = P = B − A(X);
F OR j = 1, 2, . . . , k DO {
AP = A(P ); al = IP (R, P )/IP (P, AP );
X = X + al ∗ P ; R = R − al ∗ AP ;
be = IP (R, AP )/IP (P, AP );
P = R − be ∗ P ;
}
RET U RN
EN D
The above code is somewhat terse but was included to illustrate the fact
that one can implement CG with exactly 3 extra vectors, R, P , and AP .
An actual code would include extra checks for consistency and convergence.
For example, a tolerance might be passed and the residual might be tested
against it causing the routine to return when the desired tolerance was
achieved. Also, for consistency, one would check to see that (Ap, p) > 0
6
for when this in negative or zero, it is a sure sign that the matrix is either
no good (not SPD) or that you have iterated to convergence (if (Ap, p) = 0).
Download