ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Lecture 25: Krylov Subspace Methods Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign haozhu@illinois.edu 12/2/2014 1 Announcements • • No class on Thursday Dec 4 Homework 8 posted, due on Thursday Dec 11 2 Krylov Subspace Outline • • • • • • Review of fields and vector spaces Eigensystem basics Definition of Krylov subspaces and annihilating polynomial Generic Krylov subspace solver Steepest descent Conjugate gradient 3 Krylov Subspace • Iterative methods to solve Ax=b build on the idea that m 1 1 1 j x A b a j 1 A b a0 j 0 • Given a matrix A and a vector v, the ith order Krylov subspace is defined as 𝐊 𝑖 𝐯, 𝐀 = span {𝐯, 𝐀𝐯, 𝐀2 𝐯, … , 𝐀𝑖−1 𝐯} • For a specified matrix A and a vector v, the largest value of i is bounded 4 Generic Krylov Subspace Solver • • • The following is a generic Krylov subspace solver method for solving Ax = b using only matrix vector multiplies Step 1: Start with an initial guess x(0) and some predefined error tolerance e > 0; compute the residual, r(0) = b – A x(0); set i = 0 Step 2: While ||r(i) || e Do (a) i := i + 1 (b) get Ki(r(0),A) (c) find x(i) in {x(0) + Ki(r(0),A)} to minimize ||r(i) || Stop 5 Krylov Subspace Solver • Note that no calculations are performed in Step 2 once i becomes greater than its largest value The Krylov subspace methods differ from each other in • – • • – the construction scheme of the Krylov subspace in Step 2(b) of the scheme the residual minimization criterion used in Step 2(c) A common initial guess is x(0) = 0, giving r(0) = b – A x(0) = b Every solver involves the A matrix only in matrixvector products: Air(0), i=1,2,… 6 Iterative Optimization Methods • • • • Directly constructing the Krylov Subspace for any A and r(0) would be computationally expensive We will instead introduce iterative optimization methods for solving Ax = b, which turns out to be a a special case of Krylov Subspace method Without loss of generality, consider the system Ax = b where A is symmetric (i.e., A = AT) and positive definite (i.e., A≻0, all eigenvalues nonnegative) Any Ax = b with nonsingular A is equivalent to AT Ax = AT b where AT A is symmetric and positive definite 7 Optimization Problem • • Consider the convex problem 1 𝑇 𝑓 𝒙 = 𝒙 𝑨𝒙 − 𝒃𝑇 𝒙 2 The optimal x* that minimizes f(x) is given by the solution of T • x f 0 A x b which is exactly the solution to Ax = b The classical method for convex optimization entails the application of the steepest descent scheme 8 Steepest Descent Algorithm • • • • Iteratively update x along the direction −𝛻𝑓 𝒙 = 𝒃 − 𝑨𝒙 The stepsize is selected to minimize f(x) along −𝛻𝑓 𝒙 Set i=0, e > 0, x(0) = 0, so r(i) = b - Ax(0) = b While ||r(i) || e Do (a) calculate a i i r r T r i A r i (b) x(i+1) = x(i) + a(i) r(i) (c) r(i+1) = r(i) - a(i) Ar(i) (d) i := i + 1 End While i T Note there is only one matrix, vector multiply per iteration 9 Steepest Descent Convergence • We define the A-norm of x x • 2 A x TA x We can show exponential convergence, that is 𝒙 𝑖 − 𝒙∗ ≤ 𝜅−1 𝑖 𝜅+1 𝒙 0 − 𝒙∗ where 𝜅 is the condition number of A, i.e., max min 10 Steepest Descent Convergence • • Because (𝜅-1)/(𝜅+1) < 1 the error will decrease with each steepest descent iteration, albeit potentially quite slow for large 𝜅 The function values decreases quicker, as per f x f x f x f x* i 0 • * 1 1 2i but this can still be quite slow if 𝜅 is large The issue is steepest descent often finds itself taking steps along the same direction as that of its earlier steps 11 Conjugate Direction Methods • • An improvement over the steepest descent is to take the exact number of steps using a set of search directions and obtain the solution after n such steps; this is the basic idea in the conjugate direction methods Image compares steepest descent with a conjugate direction approach Image Source: http://en.wikipedia.org/wiki/File:Conjugate_gradient_illustration.svg 12 Conjugate Direction Methods • The basic idea is the n search directions denoted by 0 1 d ,d , ,d n 1 need to be A-orthogonal, that is i T j d A d 0, • i j , i, j 0 ,1, ... , n 2 At the ith iteration, we will update x i 1 x a d i i i i 0 ,1, ... , n 2 13 Stepsize Selection • The stepsize 𝛼 (𝑖) is chosen such that 𝑓 𝒙(𝑖) + 𝛼 (𝑖) 𝒅(𝑖) = min 𝑓(𝒙 𝛼 • 𝑖 + 𝛼𝒅 𝑖 ) By setting to zero the derivative 0 = (𝒅(𝑖) )′𝛻𝑓 𝒙 𝑖 +𝛼 𝑖 𝒅 𝑖 = (𝒅(𝑖) )′(𝑨 𝒙 𝑖 + 14 Convergence Proof • To prove the convergence of conjugate direction method, we can show that 𝒙(𝑖+1) = arg min 𝑓(𝒙) where 𝑀𝑖 = {𝒙 0 + 𝒙∈𝑀𝑖 span 𝒅 0 ,…𝒅 𝑖 } • • This is exactly due to the A- orthogonality of 𝒅 𝑖 ’s Suppose all the d(0), d(1)… d(n-1) are linearly independent (l.i.), we have 𝑀𝑛−1 = span 𝒅 0 , … 𝒅 𝑛−1 = Rn • Therefore, 𝒙(𝑛) = arg min 𝑓 𝒙 = 𝒙∗ is the optimum 15 Linearly Independent Directions • • Proposition: If A is positive definite, and the set of nonzero vectors d(0), d(1)… d(n-1) are, then these vectors are linearly independent (l.i.) Proof: Suppose there are constants ai, i=0,1,2,…n such 0 1 a 0d a1d a n1d n 1 0 Recall l.i. only if all a's = 0 Multiplying by A and then scalar product with d(i) gives i T a i d Ad ( i ) 0 Since A is positive definite, it follows ai = 0 Hence, the vectors are l.i. 16 Conjugate Direction Method • Given the search direction 𝒅 𝑖 , the i-th iteration r b Ax i a i i i T i d r T i d A d i x i 1 x a d r i 1 r a A d i i i i i i What we have not yet covered is how to get the n search directions. We'll cover that shortly, but the next slide presents an algorithm, followed by an example. 17 Orthogonalization • • To quickly generate A–orthogonal search directions, one can use the Gram-Schmidt orthogonalization procedure Suppose we are given a l.i. set of n vectors {u0, u1, …, un-1}, successively construct d(j), j=0, 1, 2, … n-1, by removing from uj all the components along directions d • j 1 , d j 2 , ... , d 0 The trick is to use the gradient directions; i.e., ui = r(i) for all i=0,1,…,n-1, which yields the very popular conjugate gradient method 18 Conjugate Gradient Method • • Set i=0, e > 0, x(0) = 0, so r(0) = b - Ax(0) = b While ||r(i) || e Do (a) If i = 0 Then d(0) = r(0) Else Begin 𝛽 (𝑖) = [𝒓(𝑖) ]𝑇 𝒓(𝑖) [𝒓(𝑖−1) ]𝑇 𝒓(𝑖−1) d(i) = r(i) + b(i)d(i-1) End 19 Conjugate Gradient Algorithm (b) Update stepsize 𝛼 (𝑖) = (𝒅(𝑖) )′ 𝒓 𝑖 (𝒅(𝑖) )′𝑨 𝒅 𝑖 (c) x(i+1) = x(i) + a(i) d(i) (d) r(i+1) = r(i) - a(i) Ad(i) (e) i := i + 1 End While Note that there is only one matrix vector multiply per iteration! 20 Conjugate Gradient Example • Using the same system as before, let 10 5 4 10 3.354 A 5 12 6 , b 20 We are solving for x 1.645 4 6 10 15 3.829 • • Select i=0, x(0) = 0, e = 0.1, then r(0) = b With i = 0, d(0) = r(0) = b 21 Conjugate Gradient Example 𝛼 (0) = (𝒅(0) )′ 𝒓 0 (𝒅(0) )′𝑨 𝒅 0 x (1) x (0 ) a (10)d (10) =0.0582 0 10 0.582 0 0.0582 20 1.165 0 15 0.873 r (1) r (0 ) a (10) Ad (10) 10 10 5 4 10 1.847 20 0.0582 5 12 6 20 2.129 15 4 6 10 15 1.606 i i 1 1 This first step exactly matches Steepest Descent 22 Conjugate Gradient Example • With i=1 solve for b(1) b 21 1 d(2) • 1 T 1 r r 10.524 T 0.01452 725 r 0 r 0 1.847 10 1.992 1 21 (1) r b d 0 2.128 0.01452 20 1.838 1.606 15 1.824 Then 𝛼 (1) = (𝒅(1) )′ 𝒓 1 (𝒅(1) )′𝑨 𝒅 1 = 725 12450 = 1.388 23 Conjugate Gradient Example • And 0.582 1.993 3.348 x ( 2 ) x (1) a ( 21)d ( 21 ) 1.165 1.388 1.838 1.386 0.873 1.824 3.405 1.847 10 5 4 1.993 2.923 1 r ( 2 ) r (1) a ( 2 ) Ad ( 21 ) 2.129 1.388 5 12 6 1.838 0.532 1.606 4 6 10 1.824 2.658 i 11 2 24 Conjugate Gradient Example • With i=2 solve for b(2) b 32 2 d(3) • Then 𝛼 (3) 2 T 2 r r 15.897 T 1.511 r 1 r 1 10.524 2.924 1.992 0.086 2 23 (2) r b d 1 0.531 1.511 1.838 3.308 2.658 1.824 5.413 = (𝒅(2) )′ 𝒓 2 (𝒅(2) )′𝑨 𝒅 2 = 0.078 25 Conjugate Gradient Example • And x ( 3 ) x ( 2 ) a ( 32 )d ( 3 ) 2 3.348 0.086 3.354 1.386 0.783 3.308 1.646 3.405 5.413 3.829 r ( 3 ) r ( 2 ) a ( 32) Ad ( 32 ) 2.923 10 5 4 0.086 0 0.532 0.783 5 12 6 3.308 0 2.658 4 6 10 5.413 0 i 21 3 Done in 3 = n iterations! 26 Krylov Subspace Method • • • • Recall in the i-th iteration of the generic Krylov solver, we want to find x(i) in {x(0) + Ki(r(0),A)} that minimizes ||r(i) ||= ||b-Ax(i) || In conjugate gradient, the iterate x(i) actually minimizes 1 𝑇 𝑓 𝒙 = 𝒙 𝑨𝒙 − 𝒃𝑇 𝒙 2 over the linear manifold {x(0) + Ki(r(0),A)} With positive definite A, both methods attain 𝒙(𝑛) = 𝒙∗ = 𝐀−1 𝒃 For any invertible A, we have to use Generalized Minimum Residual Method (GMRES) 27 References • D. P. Bertsekas, Nonlinear Programming, Chapter 1, 2nd Edition, Athena Scientific, 1999 • Y. Saad, Iterative Methods for Spare Linear Systems, 2002, free online at www.users.cs.umn.edu/~saad/IterMethBook_2ndEd.pdf 28