ETNA Electronic Transactions on Numerical Analysis. Volume 8, 1999, pp. 26-35. Copyright 1997, Kent State University. ISSN 1068-9613. Kent State University etna@mcs.kent.edu PRECONDITIONERS FOR LEAST SQUARES PROBLEMS BY LU FACTORIZATION A. BJÖRCKy AND J. Y. YUANz Abstract. Iterative methods are often suitable for solving least-squares problems min kAx , bk2 , where A 2 Rmn is large and sparse. The use of the conjugate gradient method with a nonsingular square submatrix A1 2 Rnn of A as preconditioner was first suggested by Läuchli in 1961. This conjugate gradient method has recently been extended by Yuan to generalized least-squares problems. In this paper we consider the problem of finding a suitable submatrix A1 and its LU factorization for a sparse rectangular matrix A. We give three algorithms based on the sparse LU factorization algorithm by Gilbert and Peierls. Numerical results are given, which indicate that our preconditioners can be effective. Key words. Linear least squares, preconditioner, conjugate gradient method, LU factorization. AMS subject classifications. 65F10, 65F20. 1. Introduction. Consider the linear least-squares problem (1.1) min kAx , bk2; (A 2 Rmn ; b 2 Rm; m n) x2Rn where k k2 denotes the Euclidean vector norm. For large scale sparse problems iterative solution methods are often preferable to direct methods. To improve the rate of convergence of an iterative method it can be applied to the preconditioned problem (1.2) min kAS ,1y , bk2 ; y = Sx; y2Rn R where the nonsingular matrix S 2 nn is a preconditioner. Here we will consider a class of preconditioners based on a partitioning of A, (1.3) 1 PA = A A2 ; R where P is a permutation matrix and A1 2 nn a nonsingular submatrix. Such preconditioners first seem to have been suggested by Läuchli [11], and later investigated by Chen [4]. For applications see also [6, 7, 8, 12, 13, 15], and the survey in [2]. The detailed construction of preconditioners of this form for sparse problems seems to have been little studied in the literature. A powerful class of iterative methods is obtained by applying the conjugate gradient method to the normal equations of the preconditioned problem (1.2), (1.4) S ,T AT AS ,1 y = S ,T AT b: Note that for numerical stability care is needed in the implementation of this method. The stability in finite precision of the conjugate gradient and Lanczos methods for least squares problems is discussed in [3]. Received November 30, 1997. Accepted for publication December 11, 1998. Recommended by D. Calvetti December 11, 1998. The work of the second author was partially supported by CNPq, Brazil. y Department of Mathematics, Linköping University, S-581 83, Linköping, Sweden (akbjo@math.liu.se) z Department of Mathematics, Universidade Federal do Paraná, Curitiba, Paraná, Brazil (jin@gauss.mat.ufpr.br) 26 ETNA Kent State University etna@mcs.kent.edu 27 A. Björck and J. Y. Yuan Yuan [18] recently presented block iterative methods, which use for the generalized least-squares problem (1.5) A1 as preconditioner, min (Ax , b)T W ,1 (Ax , b); x2Rn where W is symmetric and positive definite. The solution to this problem satisfies the normal equations AT W ,1 Ax = AT W ,1 b. Introducing the scaled residual vector r = W ,1 (b,Ax) these can be written in augmented form as Wr + Ax AT r (1.6) =b : =0 An outline of the paper is as follows. In Section 2 we give two basic conjugate gradient methods using A1 in (1.3) as preconditioner, and give bounds for the rate of convergence. We also show how one of these can be adapted to solve the generalized least squares problem (1.5). An algorithm for selecting n linearly independent rows from A to form A1 , and performing a sparse LU factorization of A1 is outlined in Section 3. In order to save multiplication and storage, we use a modification of the algorithm of Gilbert and Peierls [9]. In order to ensure the numerical stability of this algorithm, we consider using partial pivoting. In Section 4 the LU factorization algorithms are tested on some sparse rectangular matrices from the Harwell-Boeing sparse matrix collection [5]. Numerical comparisons between the conjugate gradient method and the preconditioned conjugate gradient method with the preconditioner A1 given by our algorithms are also given for the generalized least squares problem (5) where A and W are Hilbert matrices. From the comparison results, the preconditioned conjugate gradient method is much better than the conjugate gradient method. 2. CG methods preconditioned by LU factorizations. Iterative methods using A1 in (1.3) as preconditioner often have very good convergence properties. However, the row permutation matrix P must be chosen so that the n rows of A1 are linearly independent and AA,1 1 is well conditioned. In this section we assume for simplicity that the permutation P in (1.3) has been carried 1 (m,n)n , where out in advance, and that we have computed the matrix C = A2 A, 1 2 (2.1) R AA,1 1 = A IAn,1 = CI : 2 1 Then the preconditioned problem can be written in the form (2.2) where are (2.3) ; y = A1 x; C = A2A,1 1 ; 2 In y , b1 min y C b b has been partitioned conformingly with A. The normal equations for this problem (In + C T C )y = b1 + C2T b2 ; and x can then be retrieved from A1 x = y . Läuchli [11] proposed to apply the conjugate gradient algorithm to this system of equations. This leads to the following algorithm: ETNA Kent State University etna@mcs.kent.edu 28 Preconditions for Least Squares Problems LU P RECONDITIONED CGLSI. Set r1(0) = b1 ; r2(0) = b2 ; p(0) = s(0) = b1 + C T b2 ; 0 = ks(0) k22 ; and for k = 0; 1; 2; : : : while k > tol compute q(k) = Cp(k) ; k = k =(kp(k)k22 + kq(k) k22 ); r1(k+1) = r1(k) , k p(k) ; r2(k+1) = r2(k) , k q(k) ; s(k+1) = r1(k+1) + C T r2(k+1) ; k+1 = ks(k+1) k22 ; k = k+1 =k ; p(k+1) = s(k+1) + k p(k) : Solve x from A1 x = b1 , r1 . It is well known (see Björck [2, Chap. 7.4]) that for the conjugate gradient method applied to the preconditioned normal equations (1.4) the error is reduced according to p 1 k kA(x , xk )k2 2 p , + 1 kA(x , x0 )k2 where x0 is the initial solution and = (S ,T AT AS ,1 ) the condition number of the preconditioned normal matrix. The convergence of the conjugate gradient method applied to (2.3) has been studied by Freund [8]. The eigenvalues of (In + C T C ) are (2.4) i = 1 + i2 (C ); i = 1; : : : ; n; where i (C ) are the singular values of C . It follows that (AA,1 ) = 1 p1 + 2 (C ) p p1 + 12 (C ) 1 + 2; n where (2.5) = 1 (C ) = kC k2 = kA2 A,1 1 k2 max((AA2)) : min 1 Hence, the standard error bounds based on Chebyshev polynomials for this CG method is (see Freund [8]) (2.6) kA(x , xk )k2 < 2 2k kA(x , x0 )k2 1 + 4k = p 1 + 1 + 2 : To get fast convergence we want to have small. According to (2.5) this is achieved if the blocking (1.3) can be chosen so that kA2 k2 is small and A1 well-conditioned. Since C has at most p = minfm , n; ng distinct singular values, the matrix (I + C T C ) will have at most minfp + 1; ng distinct eigenvalues. Hence, in exact arithmetic, CGLSI will converge ETNA Kent State University etna@mcs.kent.edu A. Björck and J. Y. Yuan 29 in at most minfp + 1; ng steps. Therefore, in particular, we can expect rapid convergence if p n. An alternative is obtained by eliminating r1 from the relations r1 = b1 , y = ,C T r2 ; r2 = b2 , Cy: Then we get a system of dimension (m , n) (m , n) for r2 (CC T + Im,n )r2 = b2 , Cb1 ; (2.7) and x can be retrieved from A1 x = b1 + C T r2 : (2.8) The system (2.7) can be interpreted as the normal equations for (2.9) ,C T b 1 min r Im,n r2 , b2 : 2 If (m , n) is small the system (2.7) can be cheaply solved by Cholesky factorization, A QR decomposition of the matrix (,C Im,n )T of size n (m , n) also can be used to solve for r2 . Note that the special form of this matrix allows for further economizing in the QR decomposition, see [2, Sec. 2.7.2]. The conjugate gradient method can be adapted to solve the system (2.7). The resulting algorithm is as follows: LU P RECONDITIONED CGLSII. Set v1(0) = b1 ; v2(0) = b2 ; p(0) = s(0) = b2 , Cb1 ; 0 = ks(0) k22 ; and for k = 0; 1; 2; : : : while k > tol compute q(k) = ,C T p(k) ; k = k =(kp(k)k22 + kq(k) k22 ); v1(k+1) = v1(k) , k q(k) ; v2(k+1) = v2(k) , k p(k) ; s(k+1) = v2(k+1) , Cv1(k+1) ; k+1 = ks(k+1) k22 ; k = k+1 =k ; p(k+1) = s(k+1) + k p(k) : Solve A1 x = b1 , C T (b2 , v2 ) for x. Algorithm CGLSII requires about the same storage and work as CGLSI. Since the eigenvalues of (I + CC T ) are the same as those of (I + C T C ) the convergence rate will also be the same. Yuan [18] has shown that an advantage of CGLSII is that it can more easily be adapted to solve generalized least squares problems. For the generalized problem (1.5) the normal equations (2.3) become (2.10) ( In C T ) W ,1 I n ,1 T ,C y = ( In C ) W b; A1 x = y: ETNA Kent State University etna@mcs.kent.edu 30 Preconditions for Least Squares Problems On the other hand the equations (2.7) and (2.8) generalize to (2.11) (2.12) T ( ,C Im,n ) W I,C r2 = b2 , Cb1 ; m,n T A1 x = b1 , ( In 0 ) W I,C r2 : m,n Here the symmetric system (2.11) can be solved by the conjugate gradient method without the need of inverting or factorizing the full covariance matrix W . Yuan shows in [18] that the standard error bounds based on Chebyshev polynomials for this CG method is (2.13) where kA(x , xk )kW ,1 < 2 2k ; kA(x , x0 )kW ,1 1 + 4k 2 = (1 p + 2 ) , 1 : (1 + (1 + 2 ) )2 = (W ) is the spectral condition number of W , and kxkW ,1 = (xT W ,1 x)1=2 : (For = 1 this estimate reduces to (2.6).) T HEOREM 2.1. Let n and n be smallest eigenvalues of AT A and AT1 A1 respectively. Then (I + C T C ) (AT A) n : n Proof. Let 1 be the largest eigenvalue of AT A, A1 x1 eigenvector of I + C T C associated with 1 , the largest eigenvalue of I + C T C , i.e., 1 = maxkxk2 =1 xT AT Ax and 1 = maxkxk2 =1 xT (I + C T C )x. Then, 1 xT1 AT1 (I + C T C )A1 x1 = 1 xT1 AT1 A1 x1 1 n : Assume that n = minkxk2 =1 xT (I n 1. It follows that + C T C )x is the smallest eigenvalue of I + C T C . Hence, (I + C T C ) 1 1 = (AT A) n : n n R EMARK 1. If x1 is not eigenvalue of AT A associated with 1 , then (I + C T C ) < (AT A) n : n Thus we must choose the preconditioner A1 as well as possible such that n antee (I + C T C ) (AT A). n to guar- 3. LU-decomposition algorithms. Läuchli [11] used Gauss-Jordan elimination to ex1 plicitly compute the matrix C = A2 A, 1 in (2.1). Then at each iteration step in the preconditioned conjugate gradient methods matrix vector products with C and C T are computed. This approach is suitable only when (m , n) is not too large compared to n. When A has many more rows than columns it is preferable to factorize just A1 rather 1 than A and not form the matrix C = A2 A, 1 explicitly. If C is not formed, we need to ETNA Kent State University etna@mcs.kent.edu A. Björck and J. Y. Yuan 31 perform in each iterative step two matrix vector multiplications with A2 and AT2 , and solve two linear systems of the form A1 y = c (3.1) and AT1 z = d: To do this efficiently we need to compute the LU factorization of A1 . We could compute the LU factorization of the complete matrix A and then extract the LU factorization of A. However, this would require extra computation and storage. Further we would need an auxiliary array to save a copy of A2 . We will use the sparse LU-decomposition algorithm of Gilbert and Peierls [9], modified so that we get a LU decomposition of A1 only. This algorithm performs Gaussian elimination in column-wise order, and breaks the computation of each column into a symbolic and a numerical stage. The j th column of L and U are computed by solving a sparse triangular system. This approach allows partial pivoting by rows to be performed in time proportional to arithmetic operations. The algorithm of Gilbert and Peierls can be modified to work for an n m matrix with m n. For our least squares problems the matrix A is m n, with m n. Therefore we apply Gilbert and Peierls’ algorithm to the transposed matrix AT to obtain a factorization A1 = U T LT of an n n nonsingular submatrix. To reduce fill-in we first sort the rows of A by increasing nonzero count. We use similar notations as Gilbert and Peierls in [3] for description of the basic algorithm. Thus j is the index of the row of L and U being computed. aj = (aj 1 ; : : : ; aj;j ,1 )T , a0j = (ajj ; : : : ; ajn )T ; 0 l11 : : : Lj = @ ... . . . 0 .. . 0 l ::: l 1 1 j1 j ,1;j A L0j = B@ ... . . . ... CA ; lj,1;1 : : : lj,1;j,1 l1;n : : : lj,1;n and lj0 = (ljj ; : : : ; lnj )T , uj = (u1j ; : : : ; uj ,1;j )T where ljj = 1. Also we use c0j = (cjj ; : : : ; cnj )T as an intermediate result. In the basic algorithm SP1, if juii j < , where is small tolerance, we consider the ith column in AT as linearly dependent on the first i , 1 columns. Such a column is put out of all further computations in the factorization, and interchanged with a later column. We repeat this process until i = n or more than m , n linearly dependent columns have been found. The output of Algorithm SP1 is A1 = U T LT where L is unit lower triangular. A LGORITHM SP1 1. Set initial values l = m, index(i) = i, i = 1; : : : ; m; 2. For j = 1, to n do Compute column j of U and L; 2.1 solve Lj uj = aj for uj ; (aTj is the j th row in A) 2.2 cj = a0j , L0j uj ; 2.3 if jcjj j then go to 2.5 else interchange the j th column and the lth column of AT , and update index and l; 2.4 if l j then rank (A) < n and stop else goto 2.1 2.5 ujj = cjj ; lj0 = cj =ujj . ETNA Kent State University etna@mcs.kent.edu 32 Preconditions for Least Squares Problems The upper triangular solve in step 2.1 proceeds as follows (see [9]: for each ukj 6= 0 (in the topological order determined in the previous step) do k with uj = uj , ukj ( l1k : : : lj,1;k )T : Algorithm SP1 computes elements in the LU-decomposition of A1 in a column-wise fashion. Now we shall consider the LU decomposition of A1 as U T LT where U is unit upper triangular. A LGORITHM SP2 1. Set initial values l = m, index(i) = i, i = 1; : : : ; m; 2. For j = 1; : : : ; n do 2.1 solve Uj lj = aj for lj ; 2.2 cj = a0j , Uj0 lj ; 2.3 if jcjj j then go to 2.5 else interchange the j th row and the lth row of A, and update index and l; 2.4 if l j then rank (A) < n and stop else goto 2.1 2.5 ljj = cjj ; u0j = cj =ljj : 2.6 l = m Algorithm SP3 performs partial pivoting, and is obtained by making the following changes in Algorithm SP2: A LGORITHM SP3 Before step 2.3 insert 2.3a k = arg maxj in jcji j and s1 = cjk ; Change 2.6 to: 2.6 if k = j goto 2.7 else do 0 ; (i = 1; : : : ; j , 1) exchange cjj , index1(j ) and Uji 0 ; (i = 1; : : : ; j , 1). respectively with cjk , index1(k ) and Ujk 2.7 ljj = cjj u0j = cj =ljj : We finally mention an alternative approach suggested by Saunders [16] is to use Gaussian elimination with row interchanges to compute a stable, sparse factorization PA = LU , where L 2 mn unit lower trapezoidal, U upper triangular, and use U as preconditioner. The rational for this choice is that any ill-conditioning in A is usually reflected in U , and L tends to be well conditioned. A preliminary pass through the rows of A can be made to select a triangular subset with maximal diagonal elements. The matrix L is not saved, and subsequent use of the operator AU ,1 involves back-substitution with U and multiplication with A. This approach has the advantage that often there is very little fill in U , and hence U is likely to contain less non-zeros than A. R 4. Numerical experiments. All programs were written in FORTRAN 77 and executed in double precision. The computations were performed on a SUN Workstation running UNIX at Information Laboratory, the Federal University of Paraná, Curitiba, Brazil. The three LU factorization algorithms in Section 3 were tested on some sparse rectangular matrices from the Harwell-Boeing sparse matrix collection by Duff, Grimes and Lewis ETNA Kent State University etna@mcs.kent.edu A. Björck and J. Y. Yuan 33 [5]. The numerical results are given in Tables 1–3 for the case = 0. The CPU time is the time in seconds for the factorization and m and n are the numbers of rows and columns of sparse matrix A. NZ is the number of nonzero elements in matrix A and NZ 2 the number of nonzero elements in the submatrix A2 . NZL and NZU are the numbers of nonzero elements of the triangular factors L and U in the LU decomposition of A1 . We also did numerical comparisons between the conjugate gradient method and the preconditioned conjugate gradient method with the preconditioner A1 selected by algorithms in the previous section for the generalized least squares problems (GLSP) (5) where A and W are Hilbert matrices and b generated by random number, with convergence tolerance = 0:0000001. The comparison results were given in Table 4. Algorithm SP1 gives almost the same number of nonzero elements in the LU decomposition as SP2, but needs much more CPU time. Algorithms SP2 and SP3 are faster than SP1 by a factor of 5–10, but have more fill-in. We have also considered 6= 0, e.g., = 10,10 and 10,8. All three algorithms worked for the matrices tested, and gave an LU decomposition of a nonsingular submatrix A1 . All results have shown that for all algorithms the bigger we give, the less CPU time is used and the more nonzero elements in the LU decomposition of A1 result. It follows from Table 4 that the preconditioned conjugate gradient method with selected preconditioner A1 is much better than the conjugate gradient method in the number of iterations, CPU time and also precision of solutions. REFERENCES [1] J. L. BARLOW, N. K. N ICHOLS AND R. J. P LEMMONS , Iterative methods for equality-constrained least squares problems, SIAM J. Sci. Statist. Comput., 9 (1988), 892–906. [2] A. B J ÖRCK, Numerical Methods for Least Squares Problems, in Frontiers in Applied Mathematics, SIAM, 1996. [3] A. B J ÖRCK , T. E LFVING AND Z. S TRAKO S̆ , Stability of conjugate gradient and Lanczos methods for linear least squares problems, SIAM. J. Matrix Anal. Appl., 19 (1998), 720–736. [4] Y. T. C HEN, Iterative methods for linear least squares problems, Technical Report CS-75-04, University of Waterloo, Canada, 1975. [5] I. S. D UFF , R. G. G RIMES AND J. G. L EWIS , User’s guide for Harwell-Boeing sparse matrix test problems collection, Technical Report RAL-92-086, Rutherford Appleton Laboratory, 1992. [6] D. J. E VANS AND C. L I , Numerical aspects of the generalized cg-method applied to least squares problems, Computing, 41 (1989), 171–178. [7] , The theoretical aspects of the gcg-method applied to least-squares problems, Inter. J. Comput. Math., 35 (1990), 207–229. [8] R. F REUND, A note on two block SOR methods for sparse least squares problems, Linear Algebra Appl., 88/89 (1987), 211–221. [9] J. R. G ILBERT AND T. P EIERLS , Sparse partial pivoting in time proportional to arithmetic operations, SIAM J. Sci. Statist. Comput., 9 (1988), 862–874. [10] A. J ENNINGS AND M. A. A JIZ, Incomplete methods for solving AT Ax = b, SIAM J. Sci. Statist. Comput., 5 (1984), 978–987. [11] P. L ÄUCHLI , Jordan-Elimination und Ausgleichung nach kleinsten Quadraten, Numer. Math., 3 (1961), 226–240. [12] T. L. M ARKHAM , M. N EUMANN AND R. J. P LEMMONS , Convergence of a direct-iterative method for large-scale least-squares problems, Linear Algebra Appl., 69 (1985), 155–167. [13] W. N IETHAMMER , J. DE P ILLIS AND R. S. VARGA, Convergence of block iterative methods applied to sparse least-squares problems, Linear Algebra Appl., 58 (1984), 327–341. [14] C. C. PAIGE, Fast numerically stable computations for generalized least squares problems, SIAM J. Numer. Anal., 16 (1979), 165–171. [15] E. P. PAPADOPOULOU , Y. G. S ARIDAKIS AND T. S. PAPATHEODOROU, Block AOR iterative schemes for large-scale least-squares problems, SIAM J. Numer. Anal., 26 (1989), 637–660. [16] M. A. S AUNDERS , Sparse least squares by conjugate gradients: a comparison of preconditioning methods, in Proceedings of Computer Science and Statistics: Twelfth Annual Conference on the Interface, Waterloo, Canada, 1979. ETNA Kent State University etna@mcs.kent.edu 34 Preconditions for Least Squares Problems [17] J.-Y. Y UAN, The convergence of the 2-block SAOR method for the least-squares problem, Appl. Numer. Math., 11 (1993), 429–441. [18] , Iterative Methods for the Generalized Least-Squares Problem, Ph.D. thesis, Instituto de Matemática Pura e Aplicada, Rio de Janeiro, Brazil, 1993. , Numerical methods for generalized least-squares problems, J. Comp. Appl. Math., 66 (1996), [19] 571–584. [20] J.-Y. Y UAN AND A.-N. I USEM, Preconditioned conjugate gradient method for generalized least squares problems, J. Comp. Appl. Math., 71 (1996), 287–297. ETNA Kent State University etna@mcs.kent.edu 35 A. Björck and J. Y. Yuan TABLE 4.1 Algorithm SP1 with = 0 Matrix ILLC1033 WELL1033 ILLC1850 WELL1850 m n NA 1033 1033 1850 1850 320 320 712 712 4732 4732 8758 8758 NA2 3254 3257 5378 5388 NAL NAU CPU 1100 1243 3603 4152 1774 1752 4927 5301 8.96875 8.17578 144.891 153.930 TABLE 4.2 Algorithm SP2 with = 0 Matrix ILLC1033 WELL1033 ILLC1850 WELL1850 m n NA 1033 1033 1850 1850 320 320 712 712 4732 4732 8758 8758 NA2 3214 3214 5377 5377 NAL NAU CPU 1623 1802 8652 9073 654 844 2834 2627 1.82031 1.02344 25.9297 11.6280 TABLE 4.3 Algorithm SP3 with = 0 Matrix ILLC1033 WELL1033 ILLC1850 WELL1850 m n NA 1033 1033 1850 1850 320 320 712 712 4732 4732 8758 8758 NA2 3253 3256 5379 5388 NAL NAU CPU 2865 3100 13054 13058 2450 2976 12579 12581 1.92188 1.80078 11.4609 12.7108 TABLE 4.4 Comparison between CG and PCG for GLSP m n 9 9 9 9 8 8 8 8 8 7 6 5 8 7 6 5 CG IT 21 14 10 7 20 14 11 7 CPU e 0.18662 0.19877 0.22156 0.15960 0.21428 0.22700 0.14108 0.21004 0.2794D-08 0.6519D-08 0.6519D-08 0.3725D-08 0.1397D-07 0.2352D-07 0.6985D-09 0.9313D-08 PCG IT 1 4 7 1 0 1 4 6 CPU e 0.0055 0.0055 0.0396 0.0055 0.0055 0.0055 0.0391 0.0395 0.1434D-07 0.5560D-07 0.3817D-07 0.1368D-06 0.1760D-08 0.1499D-07 0.7903D-08 0.7047D-08