extra slides - UCLA Department of Mathematics

advertisement
Math 164: Optimization
Krylov subspace, nonlinear CG, and preconditioning
Instructor: Wotao Yin
Department of Mathematics, UCLA
Spring 2015
material taken from the textbook Chong-Zak, 4th Ed., and the CG paper
by Shewchuk
online discussions on piazza.com
Krylov subspace
• Definition:
Kk := span{b, Ab, . . . , Ak−1 b}
Any point x ∈ Kk can be written as
x = a0 b + a1 Ab + · · · + ak−1 Ak−1 b = a0 + a1 A + · · · + ak−1 Ak−1 b
|
{z
}
a polynomial p(A) up to (k−1) degrees
Therefore,
Kk = {p(A)b : deg(p) < k}.
• A Krylov subspace method generate
xk = arg min F (x)
x∈Kk
for some objective function F (x).
It has two components: x ∈ Kk and F (x).
CG and Krylov subspace
• CG generates, for k = 0, 1, . . .,
xk+1 ∈ span{r0 , r1 , . . . , rk }
= span{d0 , d1 , . . . , dk }
= span{d0 , Ad0 , . . . , Ak d0 }
= span{r0 , Ar0 , . . . , Ak r0 }
• Recall CG sets d0 = r0 = b =⇒
span{d0 , Ad0 , . . . , Ak d0 } = span{b, Ab, . . . , Ak b} = Kk
• Therefore,
xk ∈ Kk .
CG and Krylov subspace
• Since (ek = xk − x∗ ) ⊥A span{d0 , d1 , . . . , dk−1 },
kxk − x∗ k2A = min kx − x∗ k2A : x ∈ span{d0 , d1 , . . . , dk−1 }
• CG generates a Krylov sequence
xk = arg min F (x) :=
x∈Kk
1
kx − x∗ k2A .
2
• Recall: f (x) = 21 xT Ax − bT x. Solution x∗ gives f ∗ and obeys Ax∗ = b.
1
1
kx − x∗ k2A = (x − x∗ )T A(x − x∗ )
2
2
1
1
= xT Ax − bT x − x∗T Ax∗ + bT x∗
2
2
∗
= f (x) − f .
Hence,
xk = arg min f (x)
or
xk = arg min f (x) − f ∗ .
x∈Kk
x∈Kk
Recall Kk = {p(A)b : deg(p) < k}.
Hence, we can also view CG as
pk = arg min f (p(A)b)
p:deg(p)<k
and x is recovered by xk = pk (A)b.
To analysis its convergence, we shall analyze f (p(A)b). We will simplify it to
polynomials evaluated at the eigenvalues of A.
Spectral representation
• Spectral decomposition: A = QΛQ T , where QQ T = Q T Q = I and


λ1
Λ=

..
.


λn
λ1 , . . . , λn are eigenvalues of A.
• Spectral representations:
• x −→ y = Q T x
• x∗ −→ y∗ = Q T x∗
• b −→ b̄ = Q T b
=
• f (x) −→ f̄ (y) =
• f
∗
∗
=
1 T
y Λy
2
− b̄y
∗
= f (x ) −→ f̄ (y )
• K = span{b, Ab, . . . , Ak−1 b} −→ K̄ = span{b̄, Λb̄, . . . , Λk−1 b̄}
• xk = arg minx∈K f (x) −→ yk = arg miny∈K̄ f̄ (y)
k
k
• Error
1
kx − x∗ k2A = f (xk ) − f ∗
2
= f̄ (yk ) − f ∗
=
min
p:deg(p)<k
n
1X
(λi yi∗2 )( λi p(λi ) − 1 )2
2
| {z }
i=1
=
polynomial q(λi )
n
1X
min
q:deg(q)≤k,q(0)=1
2
(λi yi∗2 )q 2 (λi )
i=1
• Relative error
τ (x) :=
f (x) − f ∗
kx − x∗ k2A
f̄ (y) − f ∗
=
=
f (0) − f ∗
kx∗ k2A
f̄ (0) − f ∗
• Relative error at iteration k
Pn
(λi y ∗2 )q 2 (λi )
i=1
Pn i ∗2
τ (xk ) =
min
1
q:deg(q)≤k,q(0)=1
(λi yi )
2 i=1
1
2
≤
min
q:deg(q)≤k,q(0)=1
max q 2 (λi )
i=1,...,n
Fit a polynomial to λ1 , . . . , λn
• Choice of polynomial at iteration k: deg(q) ≤ k and q(0) = 1
• If q is small on all spectra of A, kx − x∗ k2A is small.
• If λ1 , . . . , λn is clustered to k points, then xk = x∗ .
• If λ1 , . . . , λn is clustered to k groups, then xk is a good approximate to x∗ .
• Since y∗ = Q T x∗ = Q −1 x∗ , if x∗ is a linear combination of just k
eigenvectors of A, then y∗ has k nonzeros and xk = x∗ .
• If x∗ can be well approximated by a linear combination of just k
eigenvectors of A, then xk a good approximate to x∗ .
• Worst-case error bound
√
kek kA ≤ 2
κ−1
√
κ+1
i
ke0 kA
is obtained by taking q as the Chebyshev polynomial of degree k that is small
on interval [λmin , λmax ].
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
6
no polynomial fitting
8
10
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
p1 fitting
6
8
10
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
6
p1 , p2 fitting
8
10
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
6
p1 , p2 , p3 fitting
8
10
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
6
p1 , p2 , p3 , p4 fitting
8
10
Example for Stanford EE364b
A ∈ R7×7 , eigenvalues shown as block dots
2
1.5
1
0.5
0
−0.5
−1
−1.5
−2
0
2
4
6
p1 , p2 , p3 , p4 , p7 fitting
8
10
Convergence of relative error
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
7
Larger example (from S. Boyd)
• Analysis of resistor circuit. Solve Gv = i, where
• vector v has node voltages
• vector i has source current
• matrix G has circuit conductance
•
•
Gii = total conductance incident on node i
Gij = −conductance between nodes i and j
• resistor circuit has 105 nodes, average node degree is 10
• around 106 nonzeros in G
• sparse Cholesky factorization of G requires too much memory
• plot
ηk :=
which is not necessarily monotonic.
krk k2
kr0 k2
Residual convergence in `2
10
4
10
2
10
0
10
−2
10
−4
10
−6
10
−8
0
10
20
ηk :=
30
krk k2
kr0 k2
40
50
60
Nonzero initial solution
• Suppose we start from x0 6= 0
• Solving Ax = b ⇐⇒ solving Az = b − Ax0 and recovering x∗ = z∗ + x0
• Two choices of CG
• initialize CG with b̄ = b − Ax0 and proceed as normal
• initialize CG with b and proceed with
xk ← arg min f (x)
x∈x0 +Kk
• They are equivalent
• Usage: warm start. Seeding the CG with a good approximate solution (e.g.,
obtained from a simpler approximate system.)
Precondition CG
• Idea: rotate/stretch so that the eigenvalues are more clustered
• Take nonsingular matrix P
• use CG to solve (P T AP)y = P T b
• recover x∗ = P −1 y∗
• One can form Ā = P T AP and apply CG; or alternatively, re-arrange CG so
that each iteration requires multiplying A and M = (PP T ) once each.
• No need to make P explicit or recover x∗ = P −1 y∗ . Keep M and update x.
• What is a good preconditioner M ?
• Naive case: M = A−1 , not practical
• Good preconditioner maintains good trade-off among
•
•
•
CG convergence speed
storage
ease of computing M r
Common preconditioners
• Diagonal preconditioner:
M = diag(1/A11 , . . . , 1/Ann )
(recall A 0 so Aii > 0, ∀i)
• Banded preconditioner M
• Approximate Cholesky factorization A ≈ L̂L̂T , where L̂ is cheap to compute,
easy to store, or both. Let M = L̂−T L̂−1 .
At each PCG iteration, M r is done as a forward and a backward solves.
• SSOR (symmetric successive over-relaxation). Suppose A = D + L + LT ,
where D is the diagonal, L is the below-diagonal part. Let
M = (D − L)D −1 (D − LT ).
• Fourier preconditioner P = F ∗ (complex-valued)
• convolution theorem: F{f ∗ g} = F{f } · F{g}, where ∗ is convolution
• the theorem holds continuously and discretely
• If Ax = c ∗ x for some vector c (or A is circulant) then
FAx = F{c ∗ x} = F{c} · F{x} = F{b}
which is equivalent to
(P ∗ AP) (P ∗ x) = P ∗ b
| {z }
a diagonal matrix
Example of diagonal preconditioning (from S. Boyd)
10
4
10
2
10
0
10
−2
10
−4
10
−6
10
−8
0
10
20
30
40
50
60
Nonlinear CG
Three changes to CG going from linear to nonlinear
• residual is no longer updated from the previous one; instead it is directly
computed
rk = −∇f (xk )
• stepsize α is often line searched
• different choices for β are no longer equivalent
Fletech-Reeves:
FR
βk+1
=
rT
k+1 rk+1
rT
k rk
Polak-Ribière:
PR
βk+1
=
rT
k+1 (rk+1 − rk )
rT
k rk
• If f is strongly convex quadratic and α is exact minimizer, then it reduces
to linear CG.
Comparison between the two choices of β
• FR has better properties
• search direction is a descent direction at least when last line search is
exact or the strong Wolfe conditions are met
• PR has better performance
• it tends to be more robust and efficient
• but the strong Wolfe conditions do not guarantee descent direction
• fix:
PR
βk = max{βk+1
, 0}
PR
equivalent to restarting CG if βk+1
< 0.
• Restarting nonlinear CG frequently is not a bad idea.
• “conjugacy” is a result of Krylov, when f is quadratic
• the less similar f is to a quadratic, the more quickly “conjugacy” gets lost
• restarting CG lets it re-adapter to the local quadratic approximation
Example (from J. Shewchuk)
Fletech-Reeve (left) v.s. Polak-Ribière (right)
Restart CG every other iteration (FR=PR)
2
6
4
2
0
-4
-2
2
-2
4
6
1
Polak-Ribière with diagonal preconditioner
2
6
4
2
0
-4
-2
2
-2
4
6
1
Download