LAPACK-Style Codes for the QR Factorization of Banded

advertisement
LAPACK-Style Codes for the QR Factorization of
Banded Matrices
Alfredo Remón-Gómez, Enrique S. Quintana-Ortı́,
Gregorio Quintana-Ortı́
{remon,quintana,gquintan}@icc.uji.es
Universidad Jaume I de Castellón (Spain)
9th International Workshop on State-of-the-Art in Scientific
and Parallel Computing
QR Factorization of Banded Matrices
1
PARA’08
Outline
1
The QR factorization
Householder transformations for the QR factorization
2
Routines for the QR factorization of general band matrices
Routine gbbqrf
3
Experimental results and conclusions
QR Factorization of Banded Matrices
2
PARA’08
The QR factorization
QR factorization
QR factorization for general matrices
Given A ∈ Rm×n , with m ≥ n, its QR factorization is given by
A = Q · R,
Where Q ∈ Rm×m is orthogonal and R ∈ Rm×n is singular
QR Factorization of Banded Matrices
3
PARA’08
The QR factorization
QR factorization for general band matrices
QR factorization for general band matrices
Given A ∈ Rm×n , with upper and lower bandwidth ku and kl
respectively, then if QR factorization with Householder
transformations is computed carefully,
Q is lower triangular with lower bandwidth kl
R is upper triangular with upper bandwidth ku + kl
QR Factorization of Banded Matrices
4
PARA’08
Householder transformations for the QR factorization
The QR factorization
Householder transformations
The Householder transformations can be employed to annihilate
elements from a vector or matrix
Householder transformation
Consider vector x = χx10



Then, I − βuuT x = 

χ¯0
0
..
.

 
 
=
 
0
With β =
2
,
uT u
u=
1
x1 /v0

−sign(χ0 )kxk2
0
..
.





0
and v0 = χ0 + sign(x0 )kxk2
Where I − βuuT x is the Householder transformation
QR Factorization of Banded Matrices
5
PARA’08
Householder transformations for the QR factorization
The QR factorization
Householder transformations for the QR factorization
Objective
Transform the general band matrix A into an upper triangular
band matrix
ku
ku
kl
kl
0
After j steps
0
I − βu1 uT1
kl
j
0
QR Factorization of Banded Matrices
I − βu2 uT2 ... I − βuj uTj · A
6
PARA’08
Householder transformations for the QR factorization
The QR factorization
Householder transformations for the QR factorization
Objective
Transform the general band matrix A into an upper triangular
band matrix
ku
kl
ku
kl
kl
0
After k steps
0
0
k
I − βu1 uT1 · · · I − βuj uTj · · · I − βuk ukT · A
QR Factorization of Banded Matrices
7
PARA’08
Householder transformations for the QR factorization
The QR factorization
Householder transformations for the QR factorization
Objective
Transform the general band matrix A into an upper triangular
band matrix
ku
ku
kl
0
kl
kl
0
After n steps
0
0
I − βu1 uT1
QR Factorization of Banded Matrices
I − βu2 uT2 ... I − βun uTn · A
8
PARA’08
Routines for the QR factorization of general band matrices
New routines for the QR factorization of general band
matrices
New routines implemented
gbbqr2
Unblocked routine
Based on BLAS-2 kernels
gbbqrf
Blocked routine
Based on BLAS-3 kernels
We will focus on the blocked routine gbbqrf
QR Factorization of Banded Matrices
9
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
Routine gbbqrf
Functionality
Computes the QR factorization of a general band matrix
Main characteristics
Exploits the matrix structure so the memory requirements and
the computational cost are reduced
Use of Householder transformations
Based on BLAS-3 operations
QR Factorization of Banded Matrices
10
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Householder transformations)
Householder transformations make the resultant matrix R becomes
an upper triangular band with bandwidth ≤ ku + kl
ku
ku
ku
kl
kl
kl
m
=
m
n
QR Factorization of Banded Matrices
11
n
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Storage scheme)
Matrix A is stored following a variant of the compact storage
scheme employed by BLAS and LAPACK routines
In this variant, kl extra upper diagonals are stored, initially
padded with zeros
α 00
α 01
α 10
α 11
α 12
α 20
α 21
α 22
α 23
α 31
α 32
α 33
α 34
α 42
α 43
α 44
α 53
α 54
*
*
*
0
0
0
*
*
0
0
0
0
*
α 01
α 12
α 23
α 34
α 45
α 00
α 11
α 22
α 33
α 44
α 55
α 45
α 10 α 21
α 32
α 43
α 54
*
α 55
α 20 α 31
α 42
α 53
*
*
m = n = 6, kl = 2 and ku = 1
This scheme is analogous to the one employed in LAPACK
routines for the LU factorization
QR Factorization of Banded Matrices
12
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Storage scheme)
Following the LAPACK-Style, matrices the reflectors and R
replace matrix A
The extra kl upper-diagonals padded with zeros are necessary
to store all the elements of R
*
*
*
0
0
0
*
*
*
*
*
0
0
0
0
*
*
ρ
*
α 01
α 12
α 23
α 34
α 45
*
ρ
01
α 00
ρ
03
ρ
14
ρ
25
02
ρ
13
ρ
24
ρ
35
ρ
12
ρ
23
ρ
34
ρ
45
55
α 11
α 22
α 33
α 44
α 55
ρ
00
ρ
11
ρ
22
ρ
33
ρ
44
ρ
α 10 α 21
α 32
α 43
α 54
*
µ
10
µ
21
µ
32
µ
43
µ
54
*
α 20 α 31
α 42
α 53
*
*
µ
20
µ
31
µ
42
µ
53
*
QR Factorization of Banded Matrices
13
*
PARA’08
R
Q
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
gbbqrf implements an iterative algorithm with a 5 × 5
partitioning
n
A 00
nb
m
A 22
kl
nb
A 44
nb
ku+kl
nb
The operations performed in each iteration are
The QR factorization of nb columns of A is computed
The next ku + kl columns of A are updated
QR Factorization of Banded Matrices
14
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
gbbqrf implements an iterative algorithm with a 5 × 5
partitioning
n
A 00
nb
m
A 22
kl
nb
A 44
nb
ku+kl
nb
Blocks in gray have been computed
Blocks in blue form the active region (updated during the
iteration)
Blocks in green have not been accessed
QR Factorization of Banded Matrices
15
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
During the current iteration, only blocks in the active region will
be accessed
n
nb
A 11
m
ku
kl
A 12
A 21
nb
kl
A 22
nb
ku+kl
Note that the lower triangular part of the (nb × nb) block
positioned at the bottom of A21 is out of the band
QR Factorization of Banded Matrices
16
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
Step 1: QR factorization of blocks A11 and A21
Using the new non-blocked BLAS-2 routine gbbqr2, the QR
factorization of blocks A11 and A21 is obtained
A 11
A 12
A 21
A 22
(gbbqr2)
QR Factorization of Banded Matrices
A11
A21
= qr
17
A11
A21
, r ,
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
Step 2.1: Apply the Householder transformations to blocks A12
and A22
Copy A11 and A21 into a rectangular workspace where the
whole block A21 is stored
A 11
U
A 21
0
U
=
A11
A21
,
stril(U ) = 0,
QR Factorization of Banded Matrices
18
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
Step 2.2: Apply the Householder transformations to blocks A12
and A22
Compute of the Householder transformation and update of
A12 and A22
A 11
A 12
A 21
A 22
(larft)
(larfb)
T
= larft(U, r),
A12
A12
T
= (I − U T U ) ·
,
A22
A22
QR Factorization of Banded Matrices
19
PARA’08
Routine gbbqrf
Routines for the QR factorization of general band matrices
gbbqrf (Algorithm)
The lower triangular part of the nb × nb block at the bottom
of A21 contains only null elements
No attemption to exploit this structure is done because:
As b kl no huge savings are expected Dealing with this
structure would increment the number of invocations to BLAS
routines
Step 3: Move boundaries
The active region moves along the diagonal nb rows and
columns
QR Factorization of Banded Matrices
20
PARA’08
Experimental results and conclusions
Experimental Results
Platform
Architecture
ra
caton
Intel Xeon
Intel Itanium2
Platform
Compiler
ra
gcc 3.3.5
caton
icc 9.0
Frequency
(GHz)
2.4
1.5
Optimization
flags
-O3
-O3
L2 cache
(KBytes)
512
256
L3 cache
(MBytes)
–
4
BLAS
Goto BLAS 1.15
MKL 8.1
Goto BLAS 1.15
MKL 8.0
RAM
(GBytes)
1
4
Operating
System
Linux 2.4.27
Linux 2.4.21
Results correspond to matrices of order n=5,000 with different
bandwidths (from 10 to 1250) and block sizes (from 1 to 100)
QR Factorization of Banded Matrices
21
PARA’08
Experimental results and conclusions
Experimental results ra (1 proc.)
QR factorization of Band Matrices on RA (1 processor)
DGBBQR2 (GotoBLAS)
DGBBQRF (GotoBLAS)
DGBBQR2 (MKL)
DGBBQRF (MKL)
4.5
4
GFLOPs
3.5
3
2.5
2
1.5
1
0.5
0
0
200
QR Factorization of Banded Matrices
400
600
800
Matrix band
22
1000
PARA’08
1200
Experimental results and conclusions
Experimental results ra (2 proc.)
QR factorization of Band Matrices on RA (2 processors)
DGBBQR2 (GotoBLAS)
DGBBQRF (GotoBLAS)
DGBBQR2 (MKL)
DGBBQRF (MKL)
9
8
GFLOPs
7
6
5
4
3
2
1
0
0
200
QR Factorization of Banded Matrices
400
600
800
Matrix band
23
1000
PARA’08
1200
Experimental results and conclusions
Experimental results caton (1 proc.)
QR factorization of Band Matrices on CATON (1 processor)
6
DGBBQR2 (GotoBLAS)
DGBBQRF (GotoBLAS)
DGBBQR2 (MKL)
DGBBQRF (MKL)
5
GFLOPs
4
3
2
1
0
0
200
QR Factorization of Banded Matrices
400
600
800
Matrix band
24
1000
PARA’08
1200
Experimental results and conclusions
Experimental results caton (4 proc.)
QR factorization of Band Matrices on CATON (4 processors)
DGBBQR2 (GotoBLAS)
DGBBQRF (GotoBLAS)
DGBBQR2 (MKL)
DGBBQRF (MKL)
GFLOPs
20
15
10
5
0
0
200
QR Factorization of Banded Matrices
400
600
800
Matrix band
25
1000
PARA’08
1200
Experimental results and conclusions
Conclusions
Two new LAPACK-Style routines for the computation of the
QR factorization of band matrices have been presented
Both routines exploit the matrix structure to reduce the
memory requirements and the number of computations
The new codes have been evaluated in two current platforms,
showing the higher performance of the blocked routine
Results show that the operation presents a limited degree of
parallelism
QR Factorization of Banded Matrices
26
PARA’08
Experimental results and conclusions
Thanks.
QR Factorization of Banded Matrices
27
PARA’08
Download