Iterative Methods 1 Serial Iterative Methods March 30, 2015

advertisement
Iterative Methods
March 30, 2015
1
Serial Iterative Methods
Iterative methods for solving a linear system Ax = b are used when methods such as Gaussian
elimination require too much time or too much space. Methods such as Gaussian elimination, LU
decomposition followed by back substitution, that compute the exact solution after a finite number
of steps (in the absence of roundoff) are called direct methods. In contrast to direct methods,
iterative methods generally do not produce the exact answer after a finite number of steps. but
decrease the norm of the residual kb − Axk, or some other measure of error by some fraction after
each step. Iteration ceases when the error is less than a user-supplied threshold. The final error
depends on how many iterations one does as well as on the properties of the method and the linear
system. It also depends on the machine precision of the target architecture. For example if you are
computing using single precision datatypes (float) then it does not make much sense to set the
tolerance lower than 10−7 (Why?). Our overall goal is to develop methods that decrease the error
by a large amount at each iteration and do as little work per iteration as possible.
Iterative methods are especially useful when the matrix A is sparse because, unlike direct methods, no fill occurs. Also, compared with direct methods, iterative methods are easier to parallelize.
For extremely large scale problems an added advantage is that iterative methods only require the
application of the Matrix-Vector product (matvec) and are therefore suitable for matrix-free implementations. We shall first quickly review the Poisson equation and its discretizations using finite
differences. We shall then review some popular iterative methods before looking at their parallel
versions.
1.1
1.1.1
Poisson’s Equation
Poisson’s equation in One Dimension
We begin with a one-dimensional version of Poisson’s equation,
−
d2 v(x)
= f (x),
dx2
0 < x < 1,
(1)
where f (x) is a given function and v(x) is the unknown function that we want to compute. v(x)
must also satisfy the boundary conditions1 v(0) = v(1) = 0. We discretize the problem by trying
to compute an approximate solution at N + 2 evenly spaced points xi between 0 and 1: xi = ih,
where h = N1+1 and 0 ≤ i ≤ N + 1. We abbreviate vi = v(xi ) and fi = f (xi ). To convert the
1 These
are called Dirichlet boundary conditions. Other kinds of boundary conditions are also possible.
1
CS6230: HPC & Parallel Algorithms
2
differential equation (1) into a linear equation for the unknowns v1 , . . . , vN , we use finite differences
to approximate
vi − vi−1
dv(x) ≈
dx x=(i−.5)h
h
dv(x) vi+1 − vi
≈
dx x=(i+.5)h
h
Subtracting these approximations and dividing by h yield the centered difference approximation
2vi − vi−1 − vi+1
d2 v(x) ≈
− τi ,
(2)
−
2
dx
h2
x=xi
4 d v
where τi , the so-called truncation error, can be shown to be O h2 · dx
. We may now rewrite
4
∞
equation (1) at x = xi as
−vi−1 + 2vi − vi+1 = h2 fi + h2 τi ,
where 0 < i < N + 1. Since the boundary conditions imply that v0 =
equations N unknowns v1 , . . . , vN :
 



v1
2 −1
0
v1



 .. 
.
.
.
..
  ..
 −1 . .
 . 





AN ·  .  ≡ 
· .
..
..

 .. 
. −1   ..
.
vN
0
−1 2
vN




τ1
f1
 .. 
 .. 


 . 
2 . 

= h2 
 . +h  . 
 .. 
 .. 
τN
fN
vN +1 = 0, we have N






(3)
(4)
or
AN v = h2 f + h2 τ̄
(5)
To solve this equation, we will ignore τ̄ , since it is small compared to f , to get
AN v = h2 f
(6)
This is equivalent to the archetypal Ax = b system, with x = v and b = h2 f .
1.1.2
Poisson’s Equation in Two Dimensions
Now we turn to Poisson’s equation in two dimensions:
−
∂ 2 v(x, y) ∂ 2 v(x, y)
−
= f (x, y)
∂x2
∂y 2
(7)
on the unit square {(x, y) : 0 < x, y < 1}, with boundary condition v = 0 on the boundary of the
square. We discretize at the grid points in the square which are at (xi , yj ) with xi = ih and yj = jh,
with h = N1+1 . We abbreviate vij = v(ih, jh) and fij = f (ih, jh), as shown below for N = 3:
CS6230: HPC & Parallel Algorithms
3
y
h
j = 4
h
j = 3
j = 2
j = 1
x
j = 0
i = 0
i = 1
i = 2
i = 3
i = 4
From equation (2), we know that we can approximate
∂ 2 v(x, y) 2vi,j − vi−1,j − vi+1,j
−
≈
and
2
∂x
h2
x=xi ,y=yi
∂ 2 v(x, y) 2vi,j − vi,j−1 − vi,j+1
−
≈
.
∂y 2 x=xi ,y=yi
h2
Adding these approximations lets us write
∂ 2 v(x, y) ∂ 2 v(x, y) −
=
−
∂x2
∂y 2 x=xi ,y=yi
4vi,j − vi−1,j − vi+1,j − vi,j−1 − vi,j+1
− τij
h2
(8)
where τij is again a truncation error bounded by O(h2 ). The blue cross in the middle of the above
figure is called the (5-point) stencil of this equation, because it connects all (5) values of v present
in equation (8). From the boundary conditions we know v0,j = vN +1,j = vi,0 = vi,N +1 = 0, so that
equation (8) defines a set of n = N 2 linear equations in the n unknowns vij for 1 ≤ i, j ≤ N :
4vij − vi−1,j − vi+1,j − vi,j−1 − vi,j+1 = h2 fij .
1.2
(9)
Summary of Methods for Solving Poisson’s Equation
Table 1 lists the costs of various direct and iterative methods for solving the model problem on an
N × N grid. The variable n = N 2 , is the number of unknowns. Since direct methods provide the
exact answer (in the absence of roundoff), whereas iterative methods provide only approximate
answers, we must be careful when comparing their costs, since a low-accuracy answer can be
computed more cheaply by an iterative method than a high-accuracy answer. Therefore, we compare
costs, assuming that the iterative methods iterate often enough to make the error at most some
fixed small value2 (say, 10−6 ).
2 Alternatively, we could iterate until the error is O(h2 ) = O((N + 1)−2 ), the size of the truncation error. One
can show that this would increase the costs of the iterative methods in Table 1 by a factor of O(log n)
CS6230: HPC & Parallel Algorithms
4
Method
Dense Cholesky
Explicit inverse
Band Cholesky
Jacobi’s
Gauß-Seidel
Sparse Cholesky
Conjugate Gradient
SOR
SSOR with Chebyshev accel.
Fast Fourier Transform
Block cyclic reduction
Multigrid
lower bound
Serial time
n3
n2
n2
n2
n2
n3/2
n3/2
n3/2
n5/4
n log n
n log n
n
n
Space
n2
n2
n3/2
n
n
n log n
n
n
n
n
n
n
n
Type
Direct
Direct
Direct
Iterative
Iterative
Direct
Iterative
Iterative
Iterative
Direct
Direct
Iterative
Table 1: Order of complexity of solving Poisson’s equation on an N × N grid (n = N 2 ).
2
Stationary Iterative Methods
The oldest and simplest iterations for solving linear systems are stationary iterations. These iterations have largely been supplanted by more sophisticated methods (such as Krylov subspace
methods), but they are still a useful building block. Stationary iterations are so named because
the solution to a linear system is expressed as finding the stationary point (fixed point) of some
fixed-point iteration
x(k+1) = F (x(k) ).
As is usually the case with fixed point iterations—linear or nonlinear—the simplest way to establish
convergence is generally to establish that the mapping is a contraction, i.e.
kF (x) − F (y)k ≤ αkx − yk,
α < 1.
The constant α then establishes the rate of convergence. If we are solving a linear equation
Ax = b, it generally makes sense to choose a fixed point iteration where the mapping F is affine. We
can write any such fixed point iteration via a splitting of the matrix A, i.e. by writing A = M − K
with M nonsingular. Then we can rewrite Ax = b in the form
M x = Kx + b,
or, equivalently,
x = x + M −1 (b − Ax).
The fixed point iteration is then
x(k+1) = M −1 (Kx(k) + b) = x(k) + M −1 (b − Ax(k) ).
If we define R = M −1 K, and c = M −1 b, we can write the iteration as
x(k+1) = Rx(k) + c,
CS6230: HPC & Parallel Algorithms
5
and the error e(k) = x(k) − x obeys the iteration
e(k+1) = Re(k) .
A sufficient condition for convergence is then that kRk < 1 in some operator norm. The actual
necessary and sufficient condition is that ρ(R) < 1, where the spectral radius ρ(R) is defined as
max |λ| over eigenvalues λ of R.
2.1
Richardson Iteration
Perhaps the simplest stationary iteration is the Richardson iteration, in which M is chosen to be
proportional to the identity:
x(k+1) = xk + ω(b − Axk ).
The iteration matrix in this case is simply R = I − ωA. As long as all the eigenvalues of A have
positive real part, Richardson iteration with a small enough ω will eventually converge—but that
convergence may take a very long time. In the case that A is symmetric and positive definite, the
eigenvalues of R = I − ωA are 1 − ωλ, where λ are the eigenvalues of A. Since in this case R is
symmetric, kRk2 is the largest singular value (largest eigenvalue magnitude)
kRk2 = max(|1 − ωλmax |, |1 − ωλmin |).
The rate of convergence is optimal when
|1 − ωλmax | = |1 − ωλmin |,
which occurs when ω = 2/(λmax + λmin ). In this case, the rate of convergence in the 2-norm is
determined by
2
2λmin
=1−
.
kRk2 = 1 −
λmax + λmin
κ(A) + 1
Thus, if A is ill-conditioned, the iteration may be painfully slow.
2.2
Jacobi Method
Jacobi iteration is usually introduced by talking about “sweeping” through the variables and updating each one based on the assumption that the other variables are correct. Component by
component, we have
X
(k+1)
(k)
aii xi
+
aij xj = bi ,
j6=i
or,

(k+1)
xi
= bi −

X
(k)
aij xj  /aii .
j6=i
Alternately, we can think of Jacobi’s iteration as taking M = D to be the diagonal part of A. The
iteration matrix in this case is
R = I − D −1 A.
CS6230: HPC & Parallel Algorithms
6
If D, L, U are the diagonal, strict lower triangular, and strict upper triangular portions of A, then
the Jacobi method can be written as,
x(k+1) = D −1 b − (L + U )x(k) .
If A is strictly row diagonally dominant, then kRk∞ < 1,, and the iteration converges. When we
discuss multigrid, we will also use as a building block the damped Jacobi iteration where M = ω −1 D
for some ω < 1.
• Jacobi method requires nonzero diagonal entries, which can usually be accomplished by permuting rows and columns if not already true
• Jacobi method requires duplicate storage for x, since no component can be overwritten until
all new values have been computed
• Components of new iterate do not depend on each other, so they can be computed simultaneously
• Jacobi method does not always converge, but it is guaranteed to converge under conditions
that are often satisfied (e.g., if matrix is strictly diagonally dominant), though convergence
rate may be very slow
2.3
Gauß-Seidel Method
For the Jacobi iteration, we think of using equation j to update variable xj under the assumption
that the old values for all neighboring variables are correct. For the Gauß-Seidel iteration, we
think of updating x1 , x2 , . . . , at each step using the most recent values of all the other variables for
updates. That is, we update according to
X
X
(k+1)
(k)
aij xj
+
aij xj = bi .
j>i
j≤i
If we write A = D − L̃ − Ũ = D(I − L − U ) where −L̃ and −Ũ are the strictly lower
and upper triangular parts of A, then Gauß-Seidel corresponds to using M = D(I − L), and the
iteration operator is
R = (I − L)−1 U
In the case of strict row diagonal dominance, kRk∞ < 1; in fact, if RGS and RJ are the iteration
operators for Gauß-Seidel and Jacobi, then for strictly row diagonally dominant A we have
kRGS k∞ ≤ kRJ k < 1.
We can also show that Gauß-Seidel converges in the symmetric positive definite case. Because the
analysis technique will be relevant to some later discussions, we will take a moment to describe
the argument. If A is symmetric positive definite, then the solution of Ax = b is also the unique
minimum of the quadratic function
φ(x) =
1 T
x Ax − xT b
2
CS6230: HPC & Parallel Algorithms
7
Now suppose that x̂ is an approximate solution, and we want to get a better solution of the form
x̂0 = x̂ + αei . Note that
1 2
α aii + αeTi (Ax − b) ,
φ(x̂ + αei ) = φ(x̂) +
2
which we can minimize by choosing
aii α = eTi (b − Ax).
This exactly corresponds to the update
aii x(new) +
X
aij x(prev) = bi .
j6=i
Thus, Gauß-Seidel iteration can be seen as a coordinate-descent minimization algorithm with exact
line searches. Furthermore, note that if Ax = b and x̂ = x + e is an approximation, then
1
φ(x̂) = φ(x) + eT Ae,
2
so
φ(x̂) − φ(x) = kek2A /2,
where kekA is the error in the “energy norm” induced by the positive definite matrix A. So in this
case, the Gauß-Seidel iteration is monotonically convergent in the norm associated with A.
In many practical cases, even those that are not strictly diagonally dominant or symmetric and
positive definite, Gauß-Seidel converges somewhat faster than Jacobi. But proving this requires
knowing something about the structure of the problem. Outside of strictly row-diagonally dominant
A, there are examples where Jacobi converges and Gauß-Seidel does not, and vice-versa.
In summary,
• Gauß-Seidel requires nonzero diagonal entries
• Gauß-Seidel does not require duplicate storage for x, since component values can be overwritten as they are computed
• But each component depends on previous ones, so they must be computed successively
• Gauß-Seidel does not always converge, but it is guaranteed to converge under conditions that
are somewhat weaker than those for Jacobi method (e.g., if matrix is symmetric and positive
definite)
• Gauß-Seidel converges about twice as fast as Jacobi, but may still be very slow
2.4
Successive Over-relaxation (SOR)
If the Gauß-Seidel iteration gives us a good update, perhaps going even further in the GaußSeidel direction would give an even better update. This is the idea behind SOR (successive overrelaxation);
(k+1)
x(k+1) = (1 − ω)x(k) + ωxGS
CS6230: HPC & Parallel Algorithms
8
The case where ω < 1 is called under-relaxation; the case ω > 1 is over-relaxation. If A is
symmetric, SOR can be written as the application of a symmetric matrix; this is the Symmetric
Successive Over-Relaxation method. SOR diverges unless 0 < ω < 2, but choosing optimal ω is
difficult in general except for special classes of matrices. With optimal value for ω, convergence
rate of SOR method can be order of magnitude faster than that of Gauß-Seidel
3
Krylov Subspace Methods
These methods are used both to solve Ax = b and to find eigenvalues of A. They assume that A is
accessible only via a “black-box” subroutine that returns y = Az given any z (and perhaps y = AT z
if A is nonsymmetric). In other words, no direct access or manipulation of matrix entries is used.
This is a reasonable assumption for several reasons. First, the cheapest nontrivial operation that
one can perform on a (sparse) matrix is to multiply it by a vector; if A has m nonzero entries,
matrix-vector multiplication costs m multiplications and (at most) m additions. Second, A may
not be represented explicitly as a matrix but may be available only as a subroutine for computing
Ax.
A variety of different Krylov subspace methods exist. Some are suitable for nonsymmetric
matrices, and others assume symmetry or positive definiteness. Some methods for nonsymmetric
matrices assume that AT z can be computed as well as Az; depending on how A is represented,
AT z may or may not be available. The most efficient and best understood method, the conjugate
gradient method (CG), is suitable only for symmetric positive definite matrices, including the model
problem.
3.1
Conjugate Gradient Method
We say that two non-zero vectors u and v are conjugate (with respect to A) if
uT Av = 0.
Since A is symmetric and positive definite, the left-hand side defines an inner product
hu, viA := hAu, vi = hu, AT vi = hu, Avi = uT Av.
Two vectors are conjugate if and only if they are orthogonal with respect to this inner product.
Being conjugate is a symmetric relation: if u is conjugate to v, then v is conjugate to u.
Suppose that P = {pk : ∀i 6= k, i, k ∈ [1, n], hpi , pk iA = 0} is a set of n mutually conjugate
directions. Then P is a basis of Rn , so within P we can expand the solution x∗ of Ax = b :
x∗ =
n
X
αi pi
i=1
and we see that
b = Ax∗ =
n
X
αi Api .
i=1
For any pk ∈ P ,
T
pT
k b = pk Ax∗ =
n
X
i=1
T
αi pT
k Api = αk pk Apk .
CS6230: HPC & Parallel Algorithms
9
(because ∀i 6= k, pi , pk are mutually conjugate)
αk =
pT
hpk , bi
hpk , bi
kb
=
=
.
T
hpk , pk iA
kpk k2A
pk Apk
This gives the following method for solving the equation Ax = b: find a sequence of n conjugate
directions, and then compute the coefficients αk .
If we choose the conjugate vectors pk carefully, then we may not need all of them to obtain a
good approximation to the solution x∗ . So,
We denote the initial guess for x∗ by x0 . Starting with x0 we search for the solution and in each
iteration we need a metric to tell us whether we are closer to the solution x∗ . This metric comes
from the fact that the solution x∗ is also the unique minimizer of the following quadratic function;
f (x) =
1 T
x Ax − xT b,
2
x ∈ Rn .
This suggests taking the first basis vector p0 to be the negative of the gradient of f at x = x0 .
The gradient of f equals Axb. Starting with a x0 , this means we take p0 = bAx0 . The other vectors
in the basis will be conjugate to the gradient, hence the name conjugate gradient method.
Let rk be the residual at the k th step:
rk = b − Axk .
Note that rk is the negative gradient of f at x = xk , so the gradient descent method would be to
move in the direction rk . Here, we insist that the directions pk be conjugate to each other. We also
require that the next search direction be built out of the current residue and all previous search
directions, which is reasonable enough in practice.
This gives the following expression:
pk = rk −
X pT Ark
i
i<k
pT
i Api
pi
Following this direction, the next optimal location is given by
xk+1 = xk + αk pk
with
αk =
pT
kb
T
pk Apk
=
pT
pT
k (rk−1 + Axk−1 )
k rk−1
= T
,
T
pk Apk
pk Apk
where the last equality holds because pk and xk−1 are conjugate.
Key features
of Conjugate Gradient
• Short recurrence determines search directions that are A-orthogonal (conjugate)
• Error is minimal over space spanned by search directions generated so far
• Minimum error property implies that method produces exact solution after at most n steps
CS6230: HPC & Parallel Algorithms
10
• In practice, rounding error causes loss of orthogonality that spoils finite termination property,
so method is used iteratively
• Error is reduced at each iteration by factor of
√
√
( κ − 1)/( κ + 1)
on average where
κ = cond(A) = kAk · kA−1 k = λmax (A)/λmin (A)
• Thus, convergence tends to be rapid if matrix is well-conditioned, but can be arbitrarily slow
if matrix is ill-conditioned
• But convergence also depends on clustering of eigenvalues of A.
3.2
Preconditioning
In the previous section we saw that the convergence rate of CG depended on the condition number of A, or more generally the distribution of A’s eigenvalues. Other Krylov subspace methods
have the same property. Preconditioning means replacing the system Ax = b with the system
M−1 Ax = M−1 b, where M is an approximation to A with the properties that
1. M is symmetric and positive definite,
2. M−1 A is well conditioned or has few extreme eigenvalues,
3. Mx = b is easy to solve.
A careful, problem-dependent choice of M can often make the condition number of M−1 A much
smaller than the condition number of A and thus accelerate convergence dramatically. Indeed,
a good preconditioner is often necessary for an iterative method to converge at all, and much
current research in iterative methods is directed at finding better preconditioners. We will consider
Multigrid as a preconditioner for CG later in class. The combination results is a very powerful
optimal solver.
4
Parallel Implementation
Iterative methods for linear systems are composed of basic operations such as
• vector updates (daxpy)
• inner products
• matrix-vector multiplication (matvec)
• solution of triangular systems
CS6230: HPC & Parallel Algorithms
11
In the parallel implementation, both data and operations are partitioned across multiple tasks.
In addition to the communication required for these basic operations, necessary convergence test
may require additional communication (e.g., sum or max reduction). Iterative methods typically
require several vectors, including solution x, right-hand side b, residual r = b − Ax, and possibly
others. These vectors are typically uniformly partitioned among p tasks, with a given task holding
the same set of component indices of each vector. Thus, vector updates require no communication,
whereas inner products of vectors require reductions across tasks, at costs we have already seen.
4.1
Partitioning of Sparse Matrices and Matrix-free implementations
Sparse matrix A can be partitioned among tasks by rows, by columns, or by submatrices. Partitioning using equal-sized submatrices may give uneven distribution of nonzeros among tasks; indeed,
some submatrices may contain no nonzeros at all. Adaptively determining submatrices can give
better results, but is a hard problem to partition. Partitioning by rows or by columns tends to yield
more uniform distribution because sparse matrices typically have about same number of nonzeros
in each row or column.
For large systems, it is generally preferable to use matrix-free implementations. In such methods, instead of storing and having to partition the matrix, we compute the matrix-vector product
(matvec) directly. This means that the partitioning is done only on the vectors, say x and b. Given
a vector x, each process owns a set of indices, and computes Ax for those indices, by using the
analytical expressions for the matrix entries. For example in the case of the 2D Poisson problem,
we would use equation (9) and compute the matrix vector product or the residual. Note that
communication might be required if any of the indices (i ± 1, j), (i, j ± 1) are not owned by the
same process. If the processes are arranged in a Cartesian topology and the domain is assigned
accordingly, then these communications will only have to be performed with immediate neighbors.
Note that it is possible to overlap communication with computation while performing the matvec.
These are the steps,
• The process is ready to receive data from its neighbors (MPI Irecv)
• Send own process-boundary data to the appropriate neighbors (MPI Isend)
• compute process-internal matvec
• wait for communication to finish (MPI Wait)
• compute process-boundary matvec
4.2
Termination of iterations
Termination of iterative methods requires the computation of some form of error-norm (`2 , `∞ ,
energy), to determine when the solution is close, allowing us to terminate the iteration. These
require communication and more importantly require a reduction (typically with a sum or max).
Since reduction is a collective communication, it can be detrimental for large problems. For example,
the time for a single Jacobi iteration for our sample problem is likely to be cheaper than the norm
computation (at least for large p). In such cases a better strategy is to iterate a few times without
checking for the terminating condition to amortize the cost of norm computation. This is less
critical for methods like CG that require the norm to be computed.
Download