Math 164: Optimization Krylov subspace, nonlinear CG, and preconditioning Instructor: Wotao Yin Department of Mathematics, UCLA Spring 2015 material taken from the textbook Chong-Zak, 4th Ed., and the CG paper by Shewchuk online discussions on piazza.com Krylov subspace • Definition: Kk := span{b, Ab, . . . , Ak−1 b} Any point x ∈ Kk can be written as x = a0 b + a1 Ab + · · · + ak−1 Ak−1 b = a0 + a1 A + · · · + ak−1 Ak−1 b | {z } a polynomial p(A) up to (k−1) degrees Therefore, Kk = {p(A)b : deg(p) < k}. • A Krylov subspace method generate xk = arg min F (x) x∈Kk for some objective function F (x). It has two components: x ∈ Kk and F (x). CG and Krylov subspace • CG generates, for k = 0, 1, . . ., xk+1 ∈ span{r0 , r1 , . . . , rk } = span{d0 , d1 , . . . , dk } = span{d0 , Ad0 , . . . , Ak d0 } = span{r0 , Ar0 , . . . , Ak r0 } • Recall CG sets d0 = r0 = b =⇒ span{d0 , Ad0 , . . . , Ak d0 } = span{b, Ab, . . . , Ak b} = Kk • Therefore, xk ∈ Kk . CG and Krylov subspace • Since (ek = xk − x∗ ) ⊥A span{d0 , d1 , . . . , dk−1 }, kxk − x∗ k2A = min kx − x∗ k2A : x ∈ span{d0 , d1 , . . . , dk−1 } • CG generates a Krylov sequence xk = arg min F (x) := x∈Kk 1 kx − x∗ k2A . 2 • Recall: f (x) = 21 xT Ax − bT x. Solution x∗ gives f ∗ and obeys Ax∗ = b. 1 1 kx − x∗ k2A = (x − x∗ )T A(x − x∗ ) 2 2 1 1 = xT Ax − bT x − x∗T Ax∗ + bT x∗ 2 2 ∗ = f (x) − f . Hence, xk = arg min f (x) or xk = arg min f (x) − f ∗ . x∈Kk x∈Kk Recall Kk = {p(A)b : deg(p) < k}. Hence, we can also view CG as pk = arg min f (p(A)b) p:deg(p)<k and x is recovered by xk = pk (A)b. To analysis its convergence, we shall analyze f (p(A)b). We will simplify it to polynomials evaluated at the eigenvalues of A. Spectral representation • Spectral decomposition: A = QΛQ T , where QQ T = Q T Q = I and λ1 Λ= .. . λn λ1 , . . . , λn are eigenvalues of A. • Spectral representations: • x −→ y = Q T x • x∗ −→ y∗ = Q T x∗ • b −→ b̄ = Q T b = • f (x) −→ f̄ (y) = • f ∗ ∗ = 1 T y Λy 2 − b̄y ∗ = f (x ) −→ f̄ (y ) • K = span{b, Ab, . . . , Ak−1 b} −→ K̄ = span{b̄, Λb̄, . . . , Λk−1 b̄} • xk = arg minx∈K f (x) −→ yk = arg miny∈K̄ f̄ (y) k k • Error 1 kx − x∗ k2A = f (xk ) − f ∗ 2 = f̄ (yk ) − f ∗ = min p:deg(p)<k n 1X (λi yi∗2 )( λi p(λi ) − 1 )2 2 | {z } i=1 = polynomial q(λi ) n 1X min q:deg(q)≤k,q(0)=1 2 (λi yi∗2 )q 2 (λi ) i=1 • Relative error τ (x) := f (x) − f ∗ kx − x∗ k2A f̄ (y) − f ∗ = = f (0) − f ∗ kx∗ k2A f̄ (0) − f ∗ • Relative error at iteration k Pn (λi y ∗2 )q 2 (λi ) i=1 Pn i ∗2 τ (xk ) = min 1 q:deg(q)≤k,q(0)=1 (λi yi ) 2 i=1 1 2 ≤ min q:deg(q)≤k,q(0)=1 max q 2 (λi ) i=1,...,n Fit a polynomial to λ1 , . . . , λn • Choice of polynomial at iteration k: deg(q) ≤ k and q(0) = 1 • If q is small on all spectra of A, kx − x∗ k2A is small. • If λ1 , . . . , λn is clustered to k points, then xk = x∗ . • If λ1 , . . . , λn is clustered to k groups, then xk is a good approximate to x∗ . • Since y∗ = Q T x∗ = Q −1 x∗ , if x∗ is a linear combination of just k eigenvectors of A, then y∗ has k nonzeros and xk = x∗ . • If x∗ can be well approximated by a linear combination of just k eigenvectors of A, then xk a good approximate to x∗ . • Worst-case error bound √ kek kA ≤ 2 κ−1 √ κ+1 i ke0 kA is obtained by taking q as the Chebyshev polynomial of degree k that is small on interval [λmin , λmax ]. Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 6 no polynomial fitting 8 10 Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 p1 fitting 6 8 10 Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 6 p1 , p2 fitting 8 10 Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 6 p1 , p2 , p3 fitting 8 10 Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 6 p1 , p2 , p3 , p4 fitting 8 10 Example for Stanford EE364b A ∈ R7×7 , eigenvalues shown as block dots 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 0 2 4 6 p1 , p2 , p3 , p4 , p7 fitting 8 10 Convergence of relative error 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 Larger example (from S. Boyd) • Analysis of resistor circuit. Solve Gv = i, where • vector v has node voltages • vector i has source current • matrix G has circuit conductance • • Gii = total conductance incident on node i Gij = −conductance between nodes i and j • resistor circuit has 105 nodes, average node degree is 10 • around 106 nonzeros in G • sparse Cholesky factorization of G requires too much memory • plot ηk := which is not necessarily monotonic. krk k2 kr0 k2 Residual convergence in `2 10 4 10 2 10 0 10 −2 10 −4 10 −6 10 −8 0 10 20 ηk := 30 krk k2 kr0 k2 40 50 60 Nonzero initial solution • Suppose we start from x0 6= 0 • Solving Ax = b ⇐⇒ solving Az = b − Ax0 and recovering x∗ = z∗ + x0 • Two choices of CG • initialize CG with b̄ = b − Ax0 and proceed as normal • initialize CG with b and proceed with xk ← arg min f (x) x∈x0 +Kk • They are equivalent • Usage: warm start. Seeding the CG with a good approximate solution (e.g., obtained from a simpler approximate system.) Precondition CG • Idea: rotate/stretch so that the eigenvalues are more clustered • Take nonsingular matrix P • use CG to solve (P T AP)y = P T b • recover x∗ = P −1 y∗ • One can form Ā = P T AP and apply CG; or alternatively, re-arrange CG so that each iteration requires multiplying A and M = (PP T ) once each. • No need to make P explicit or recover x∗ = P −1 y∗ . Keep M and update x. • What is a good preconditioner M ? • Naive case: M = A−1 , not practical • Good preconditioner maintains good trade-off among • • • CG convergence speed storage ease of computing M r Common preconditioners • Diagonal preconditioner: M = diag(1/A11 , . . . , 1/Ann ) (recall A 0 so Aii > 0, ∀i) • Banded preconditioner M • Approximate Cholesky factorization A ≈ L̂L̂T , where L̂ is cheap to compute, easy to store, or both. Let M = L̂−T L̂−1 . At each PCG iteration, M r is done as a forward and a backward solves. • SSOR (symmetric successive over-relaxation). Suppose A = D + L + LT , where D is the diagonal, L is the below-diagonal part. Let M = (D − L)D −1 (D − LT ). • Fourier preconditioner P = F ∗ (complex-valued) • convolution theorem: F{f ∗ g} = F{f } · F{g}, where ∗ is convolution • the theorem holds continuously and discretely • If Ax = c ∗ x for some vector c (or A is circulant) then FAx = F{c ∗ x} = F{c} · F{x} = F{b} which is equivalent to (P ∗ AP) (P ∗ x) = P ∗ b | {z } a diagonal matrix Example of diagonal preconditioning (from S. Boyd) 10 4 10 2 10 0 10 −2 10 −4 10 −6 10 −8 0 10 20 30 40 50 60 Nonlinear CG Three changes to CG going from linear to nonlinear • residual is no longer updated from the previous one; instead it is directly computed rk = −∇f (xk ) • stepsize α is often line searched • different choices for β are no longer equivalent Fletech-Reeves: FR βk+1 = rT k+1 rk+1 rT k rk Polak-Ribière: PR βk+1 = rT k+1 (rk+1 − rk ) rT k rk • If f is strongly convex quadratic and α is exact minimizer, then it reduces to linear CG. Comparison between the two choices of β • FR has better properties • search direction is a descent direction at least when last line search is exact or the strong Wolfe conditions are met • PR has better performance • it tends to be more robust and efficient • but the strong Wolfe conditions do not guarantee descent direction • fix: PR βk = max{βk+1 , 0} PR equivalent to restarting CG if βk+1 < 0. • Restarting nonlinear CG frequently is not a bad idea. • “conjugacy” is a result of Krylov, when f is quadratic • the less similar f is to a quadratic, the more quickly “conjugacy” gets lost • restarting CG lets it re-adapter to the local quadratic approximation Example (from J. Shewchuk) Fletech-Reeve (left) v.s. Polak-Ribière (right) Restart CG every other iteration (FR=PR) 2 6 4 2 0 -4 -2 2 -2 4 6 1 Polak-Ribière with diagonal preconditioner 2 6 4 2 0 -4 -2 2 -2 4 6 1