LAPACK-Style Codes for the QR Factorization of Banded Matrices Alfredo Remón-Gómez, Enrique S. Quintana-Ortı́, Gregorio Quintana-Ortı́ {remon,quintana,gquintan}@icc.uji.es Universidad Jaume I de Castellón (Spain) 9th International Workshop on State-of-the-Art in Scientific and Parallel Computing QR Factorization of Banded Matrices 1 PARA’08 Outline 1 The QR factorization Householder transformations for the QR factorization 2 Routines for the QR factorization of general band matrices Routine gbbqrf 3 Experimental results and conclusions QR Factorization of Banded Matrices 2 PARA’08 The QR factorization QR factorization QR factorization for general matrices Given A ∈ Rm×n , with m ≥ n, its QR factorization is given by A = Q · R, Where Q ∈ Rm×m is orthogonal and R ∈ Rm×n is singular QR Factorization of Banded Matrices 3 PARA’08 The QR factorization QR factorization for general band matrices QR factorization for general band matrices Given A ∈ Rm×n , with upper and lower bandwidth ku and kl respectively, then if QR factorization with Householder transformations is computed carefully, Q is lower triangular with lower bandwidth kl R is upper triangular with upper bandwidth ku + kl QR Factorization of Banded Matrices 4 PARA’08 Householder transformations for the QR factorization The QR factorization Householder transformations The Householder transformations can be employed to annihilate elements from a vector or matrix Householder transformation Consider vector x = χx10 Then, I − βuuT x = χ¯0 0 .. . = 0 With β = 2 , uT u u= 1 x1 /v0 −sign(χ0 )kxk2 0 .. . 0 and v0 = χ0 + sign(x0 )kxk2 Where I − βuuT x is the Householder transformation QR Factorization of Banded Matrices 5 PARA’08 Householder transformations for the QR factorization The QR factorization Householder transformations for the QR factorization Objective Transform the general band matrix A into an upper triangular band matrix ku ku kl kl 0 After j steps 0 I − βu1 uT1 kl j 0 QR Factorization of Banded Matrices I − βu2 uT2 ... I − βuj uTj · A 6 PARA’08 Householder transformations for the QR factorization The QR factorization Householder transformations for the QR factorization Objective Transform the general band matrix A into an upper triangular band matrix ku kl ku kl kl 0 After k steps 0 0 k I − βu1 uT1 · · · I − βuj uTj · · · I − βuk ukT · A QR Factorization of Banded Matrices 7 PARA’08 Householder transformations for the QR factorization The QR factorization Householder transformations for the QR factorization Objective Transform the general band matrix A into an upper triangular band matrix ku ku kl 0 kl kl 0 After n steps 0 0 I − βu1 uT1 QR Factorization of Banded Matrices I − βu2 uT2 ... I − βun uTn · A 8 PARA’08 Routines for the QR factorization of general band matrices New routines for the QR factorization of general band matrices New routines implemented gbbqr2 Unblocked routine Based on BLAS-2 kernels gbbqrf Blocked routine Based on BLAS-3 kernels We will focus on the blocked routine gbbqrf QR Factorization of Banded Matrices 9 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices Routine gbbqrf Functionality Computes the QR factorization of a general band matrix Main characteristics Exploits the matrix structure so the memory requirements and the computational cost are reduced Use of Householder transformations Based on BLAS-3 operations QR Factorization of Banded Matrices 10 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Householder transformations) Householder transformations make the resultant matrix R becomes an upper triangular band with bandwidth ≤ ku + kl ku ku ku kl kl kl m = m n QR Factorization of Banded Matrices 11 n PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Storage scheme) Matrix A is stored following a variant of the compact storage scheme employed by BLAS and LAPACK routines In this variant, kl extra upper diagonals are stored, initially padded with zeros α 00 α 01 α 10 α 11 α 12 α 20 α 21 α 22 α 23 α 31 α 32 α 33 α 34 α 42 α 43 α 44 α 53 α 54 * * * 0 0 0 * * 0 0 0 0 * α 01 α 12 α 23 α 34 α 45 α 00 α 11 α 22 α 33 α 44 α 55 α 45 α 10 α 21 α 32 α 43 α 54 * α 55 α 20 α 31 α 42 α 53 * * m = n = 6, kl = 2 and ku = 1 This scheme is analogous to the one employed in LAPACK routines for the LU factorization QR Factorization of Banded Matrices 12 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Storage scheme) Following the LAPACK-Style, matrices the reflectors and R replace matrix A The extra kl upper-diagonals padded with zeros are necessary to store all the elements of R * * * 0 0 0 * * * * * 0 0 0 0 * * ρ * α 01 α 12 α 23 α 34 α 45 * ρ 01 α 00 ρ 03 ρ 14 ρ 25 02 ρ 13 ρ 24 ρ 35 ρ 12 ρ 23 ρ 34 ρ 45 55 α 11 α 22 α 33 α 44 α 55 ρ 00 ρ 11 ρ 22 ρ 33 ρ 44 ρ α 10 α 21 α 32 α 43 α 54 * µ 10 µ 21 µ 32 µ 43 µ 54 * α 20 α 31 α 42 α 53 * * µ 20 µ 31 µ 42 µ 53 * QR Factorization of Banded Matrices 13 * PARA’08 R Q Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) gbbqrf implements an iterative algorithm with a 5 × 5 partitioning n A 00 nb m A 22 kl nb A 44 nb ku+kl nb The operations performed in each iteration are The QR factorization of nb columns of A is computed The next ku + kl columns of A are updated QR Factorization of Banded Matrices 14 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) gbbqrf implements an iterative algorithm with a 5 × 5 partitioning n A 00 nb m A 22 kl nb A 44 nb ku+kl nb Blocks in gray have been computed Blocks in blue form the active region (updated during the iteration) Blocks in green have not been accessed QR Factorization of Banded Matrices 15 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) During the current iteration, only blocks in the active region will be accessed n nb A 11 m ku kl A 12 A 21 nb kl A 22 nb ku+kl Note that the lower triangular part of the (nb × nb) block positioned at the bottom of A21 is out of the band QR Factorization of Banded Matrices 16 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) Step 1: QR factorization of blocks A11 and A21 Using the new non-blocked BLAS-2 routine gbbqr2, the QR factorization of blocks A11 and A21 is obtained A 11 A 12 A 21 A 22 (gbbqr2) QR Factorization of Banded Matrices A11 A21 = qr 17 A11 A21 , r , PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) Step 2.1: Apply the Householder transformations to blocks A12 and A22 Copy A11 and A21 into a rectangular workspace where the whole block A21 is stored A 11 U A 21 0 U = A11 A21 , stril(U ) = 0, QR Factorization of Banded Matrices 18 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) Step 2.2: Apply the Householder transformations to blocks A12 and A22 Compute of the Householder transformation and update of A12 and A22 A 11 A 12 A 21 A 22 (larft) (larfb) T = larft(U, r), A12 A12 T = (I − U T U ) · , A22 A22 QR Factorization of Banded Matrices 19 PARA’08 Routine gbbqrf Routines for the QR factorization of general band matrices gbbqrf (Algorithm) The lower triangular part of the nb × nb block at the bottom of A21 contains only null elements No attemption to exploit this structure is done because: As b kl no huge savings are expected Dealing with this structure would increment the number of invocations to BLAS routines Step 3: Move boundaries The active region moves along the diagonal nb rows and columns QR Factorization of Banded Matrices 20 PARA’08 Experimental results and conclusions Experimental Results Platform Architecture ra caton Intel Xeon Intel Itanium2 Platform Compiler ra gcc 3.3.5 caton icc 9.0 Frequency (GHz) 2.4 1.5 Optimization flags -O3 -O3 L2 cache (KBytes) 512 256 L3 cache (MBytes) – 4 BLAS Goto BLAS 1.15 MKL 8.1 Goto BLAS 1.15 MKL 8.0 RAM (GBytes) 1 4 Operating System Linux 2.4.27 Linux 2.4.21 Results correspond to matrices of order n=5,000 with different bandwidths (from 10 to 1250) and block sizes (from 1 to 100) QR Factorization of Banded Matrices 21 PARA’08 Experimental results and conclusions Experimental results ra (1 proc.) QR factorization of Band Matrices on RA (1 processor) DGBBQR2 (GotoBLAS) DGBBQRF (GotoBLAS) DGBBQR2 (MKL) DGBBQRF (MKL) 4.5 4 GFLOPs 3.5 3 2.5 2 1.5 1 0.5 0 0 200 QR Factorization of Banded Matrices 400 600 800 Matrix band 22 1000 PARA’08 1200 Experimental results and conclusions Experimental results ra (2 proc.) QR factorization of Band Matrices on RA (2 processors) DGBBQR2 (GotoBLAS) DGBBQRF (GotoBLAS) DGBBQR2 (MKL) DGBBQRF (MKL) 9 8 GFLOPs 7 6 5 4 3 2 1 0 0 200 QR Factorization of Banded Matrices 400 600 800 Matrix band 23 1000 PARA’08 1200 Experimental results and conclusions Experimental results caton (1 proc.) QR factorization of Band Matrices on CATON (1 processor) 6 DGBBQR2 (GotoBLAS) DGBBQRF (GotoBLAS) DGBBQR2 (MKL) DGBBQRF (MKL) 5 GFLOPs 4 3 2 1 0 0 200 QR Factorization of Banded Matrices 400 600 800 Matrix band 24 1000 PARA’08 1200 Experimental results and conclusions Experimental results caton (4 proc.) QR factorization of Band Matrices on CATON (4 processors) DGBBQR2 (GotoBLAS) DGBBQRF (GotoBLAS) DGBBQR2 (MKL) DGBBQRF (MKL) GFLOPs 20 15 10 5 0 0 200 QR Factorization of Banded Matrices 400 600 800 Matrix band 25 1000 PARA’08 1200 Experimental results and conclusions Conclusions Two new LAPACK-Style routines for the computation of the QR factorization of band matrices have been presented Both routines exploit the matrix structure to reduce the memory requirements and the number of computations The new codes have been evaluated in two current platforms, showing the higher performance of the blocked routine Results show that the operation presents a limited degree of parallelism QR Factorization of Banded Matrices 26 PARA’08 Experimental results and conclusions Thanks. QR Factorization of Banded Matrices 27 PARA’08