Numerical Methods I: Linear solvers and least squares Georg Stadler Courant Institute, NYU stadler@cims.nyu.edu Sep. 24, 2015 1 / 27 Sources/reading Main reference: Sections 8 and 3 in Deuflhard/Hohmann. 2 / 27 Solving linear systems We study the solution of linear systems of the form Ax = b with A ∈ Rn×n , x, b ∈ Rn . We assume that this system has a unique solution, i.e., A is invertible. Solving linear systems is needed in many applications. Often, we have to solve I large systems (can be up to millions of unknowns, and more) I as fast as possible, and I accurately and reliably. 3 / 27 Kinds of linear systems Solvers such as MATLAB’s \ take advantage of matrix properties: I Dense matrix storage: Only entries are stored as 1D array (column or row wise) I Sparse matrix storage: Most aij = 0: only store nonzero entries; stores indices and value; occur in many applications 4 / 27 Kinds of linear systems Solvers such as MATLAB’s \ take advantage of matrix properties: I Dense matrix storage: Only entries are stored as 1D array (column or row wise) I Sparse matrix storage: Most aij = 0: only store nonzero entries; stores indices and value; occur in many applications Matrix application/system solution: I Dense: the best way to compute Ax is direct summation I Fast algorithms for special matrices for computing Ax, FFT, FMM, . . . I Sparse: Most aij = 0: avoid fill in in factorizations I Structured/unstructured: is the sparsity pattern easy to describe without storing it explicitly? 4 / 27 Kinds of linear systems and solvers Symmetry, positivity . . . I Special factorizations for (skew) symmetric matrices I Special factorizations for positive definite matrices (Choleski) I Diagonally dominant matrices don’t need pivoting MATLAB’s \ (i.e., UMFPACK) chooses the optimal algorithms after studying properties of the matrix (details in the “backslash” book: Tim Davis: Direct methods for sparse linear systems, SIAM, 2006.) 5 / 27 Kinds of linear systems and solvers UMFPACK’s decision tree for dense matrices 6 / 27 Kinds of linear systems and solvers UMFPACK’s decision tree for dense matrices 6 / 27 Kinds of linear systems and solvers UMFPACK’s decision tree for sparse matrices 7 / 27 Kinds of linear systems and solvers UMFPACK’s decision tree for sparse matrices 7 / 27 Kinds of linear systems and solvers Factorization-based/direct solvers (dense/sparse LU, Choleski) require the matrix - to fit into memory, - to be explicitly available (sometimes only a function that applies the matrix to a vector is available) and to fit in memory, + but compute exact (besides rounding error) solution Iterative solvers - find an ε-approximation of the solution, + able to solve very large problems, + often only require a function that computes Ax for given x ± might be faster or slower than a factorization-based method 8 / 27 Kinds of linear systems and solvers MATLAB demo I What are the different storage formats (sparse/dense)? Is it always better to use one of them? I How long does it take to solve sparse/dense systems? I What is fill in and how to avoid it? 9 / 27 Kinds of linear systems and solvers MATLAB demo Sparse/sense storage: A=rand(2,2); B=sparse(A); whos Fill-in: A=bucky + 4*speye(60); r = symrcm(A); spy(A); spy(A(r,r)); spy(chol(A)); spy(chol(A(r,r))); Which sparse solver? spparms(’spumoni’,1); A=gallery(’poisson’,8); b=randn(64,1); A\b; 10 / 27 Iterative solution of (symmetric) linear systems Target problems: very large (n = 105 , 106 , . . .), A is usually sparse has specific properties. To solve Ax = b we construct a sequence x1 , x2 , . . . of iterates that converges as fast as possible to the solution x, where xk+1 can be computed from {x1 , . . . , xk } with as little cost as possible (e.g., one matrix-vector multiplication). 11 / 27 Iterative solution of (symmetric) linear systems Let Q be invertible, then Ax = b ⇔ Q−1 (b − Ax) = 0 ⇔ (I − Q−1 A)x + Q−1 b = x ⇔ Gx + c = x Theorem: The fixed point method xk+1 = Gxk + c with an invertible G converges for each starting point xo if and only if ρ(G) < 1, where ρ(G) is the largest eigenvalue of G (i.e., the spectral radius). 12 / 27 Iterative solution of (symmetric) linear systems Choices for Q: I Q = I. . . Richardson method Consider A = L + D + U , where D is diagonal, L and U are lower and upper triangular with zero diagonal. Then: I Q = D . . . Jacobi method I Q = D + L . . . Gauss-Seidel method Convergence depends on properties of A: Richardson converges if all eigenvalues of A are in (0, 2), Jacobi converges for diagonally dominant matrices, and Gauss Seidel for spd matrices. 13 / 27 Iterative solution of (symmetric) linear systems Relaxation methods: Use linear combination between new and previous iterate: xk+1 = ω(Gxk + c) + (1 − ω)xk = Gω xk + ωc, where ω ∈ [0, 1] is a damping parameter. Target is to choose the best damping parameter such that ρ(Gω ) is as small as possible. Optimal damping parameters can be computed for Richardson and Jacobi using the eigenvalues of G (see Deuflhard/Hohmann). 14 / 27 Iterative solution of (symmetric) linear systems Chebyshev acceleration So far, the new iterate xk+1 only depended on xk . This can be improved by using all previous iterates when computing xk+1 . The resulting schemes are called Chebyshev accelerated methods, and they usually converge faster than the original iterative schemes. Chebyshev methods are based on computing linear combinations y k := k X vkj xj j=0 with suitable coefficients vkj such that y 0 , y 1 , . . . converges faster than x0 , x1 , . . . Computation of coefficient requires knowledge of the eigenvalues of G. 15 / 27 Iterative solution of (symmetric) linear systems Krylov methods: Idea: Build a basis for the Krylov subspace {r 0 , Ar 0 , A2 r 0 . . .} and reduce residual optimally in that space. I spd matrices: Conjugate gradient (CG) method I symmetric matrices: Minimal residual method (MINRES) I general matrices: Generalized residual method (GMRES), BiCG, BiCGSTAB 16 / 27 Iterative solution of (symmetric) linear systems Krylov methods: Idea: Build a basis for the Krylov subspace {r 0 , Ar 0 , A2 r 0 . . .} and reduce residual optimally in that space. I spd matrices: Conjugate gradient (CG) method I symmetric matrices: Minimal residual method (MINRES) I general matrices: Generalized residual method (GMRES), BiCG, BiCGSTAB Properties: Do not require eigenvalue estimates; require usually one matrix-vector multiplication per iteration; convergence depends on eigenvalue structure of matrix (clustering of eigenvalues aids convergence). Availability of a good preconditioner is often important. Some methods require storage of iteration vectors. 16 / 27 Least-squares problems Given data points/measurements (ti , bi ), i = 1, . . . , m and a model function φ that relates t and b: b = φ(t; x1 , . . . , xn ), where x1 , . . . , xn are model function parameters. If the model is supposed to describe the data, the deviations/errors ∆i = bi − φ(ti , x1 , . . . , xn ) should be small. Thus, to fit the model to the measurements, one must choose x1 , . . . , xn appropriately. 17 / 27 Least-squares problems Measuring deviations Least squares: Find x1 , . . . , xn such that m 1X 2 ∆i → min 2 i=1 From a probabilistic perspective, this corresponds to an underlying Gaussian noise model. Weighted least squares: Find x1 , . . . , xn such that m 1X 2 i=1 ∆i δbi 2 → min, where δbi > 0 contain information about how much we trust the ith data point. 18 / 27 Least-squares problems Measuring deviations Alternatives to using squares: L1 error: Find x1 , . . . , xn such that m X |∆i | → min i=1 Result can be very different, other statistical interpretation, more stable with respect to outliers. L∞ error: Find x1 , . . . , xn such that max |∆i | → min 1≤i≤m 19 / 27 Linear least-squares We assume (for now) that the model depends linearly on x1 , . . . , xn , e.g.: φ(t; x1 , . . . xn ) = a1 (t)x1 + . . . + an (t)xn Choosing the least square error, this results in min kAx − bk, x where x = (x1 , . . . , xn )T , b = (b1 , . . . , bm )T , and aij = aj (ti ). In the following, we study the overdetermined case, i.e., m ≥ n. 20 / 27 Linear least-squares problems–QR factorization Consider non-square matrices A ∈ Rm×n with m ≥ n and rank(A) = n. Then the system Ax = b does, in general, not have a solution (more equations than unknowns). We thus instead solve a minimization problem min kAx − bk2 . x The minimum x̄ of this optimization problem is characterized by the normal equations: AT Ax̄ = AT b. 21 / 27 Linear least-squares problems–QR factorization Solving the normal equations AT Ax̄ = AT b requires: I computing AT A (which is O(mn2 )) I condition number of AT A is square of condition number of A; (problematic for the Choleski factorization) 22 / 27 Linear least-squares problems–QR factorization Conditioning Solving the normal equation is equivalent to computing P b, the orthogonal projection of b onto the subspace V spanned by columns of A. Let x be the solution of the least square problem and denote the residual by r = b − Ax, and sin(θ) = krk2 . kbk2 23 / 27 Linear least-squares problems–QR factorization Conditioning The relative condition number κ of x in the Euclidean norm is bounded by I With respect to puerturbations in b: κ≤ I κ2 (A) cos(θ) With respect to perturbations in A: κ ≤ κ2 (A) + κ2 (A)2 tan(θ) Small residual problems cos(θ) ≈ 1, tan(θ) ≈ 0: behavior similar to linear system. Large residual problems cos(θ) 1, tan(θ) > 1: behavior essentially different from linear system. 24 / 27 Linear least-squares problems–QR factorization One would like to avoid the multiplication AT A and use a suitable factorization of A that aids in solving the normal equation, the QR-factorization: R1 A = QR = Q1 , Q2 = Q1 R1 , 0 where Q ∈ Rm×m is an orthonormal matrix (QQT = I), and R ∈ Rm×n consists of an upper triangular matrix and a block of zeros. 25 / 27 Linear least-squares problems–QR factorization How can the QR factorization be used to solve the normal equation? b − R1 x 2 min kAx − bk2 = min kQT (Ax − b)k2 min k 1 k , x x x b2 where QT b b = 1 . b2 Thus, the least squares solution is x = R−1 b1 and the residual is kb2 k. 26 / 27 Linear least-squares problems–QR factorization How can we compute the QR factorization? Givens rotations Use sequence of rotations in 2D subspaces: For m ≈ n: ∼ n2 /2 square roots, and 4/3n3 multiplications For m n: ∼ nm square roots, and 2mn2 multiplications Householder reflections Use sequence of reflections in 2D subspaces For m ≈ n: 2/3n3 multiplications For m n: 2mn2 multiplications 27 / 27