BLAS Specification Revisited - William Paterson University

advertisement
BLAS Specification Revisited
Linda Kaufman
William Paterson University
Wish list
• Simultaneous Orthogonal Transformations
– Generation
– Application
•
•
•
•
Simultaneous elementary transformations
Simultaneous gemv with different matrices
Simultaneous Householders
Symmetric rank k update that manufacturers
might want not balk at implementing
Simultaneous Orthogonal transformations
A. QZhes- reduction of A to Hessenberg form and B to triangular form
B. QR iteration for symmetric tridiagonal eigenvalues
C. Reduction of narrow banded matrix to tridiagonal form in order to solve an
eigenvalue problem Ax=λx
A. Approximation of 1 dimensional pde with a Rayleigh Ritz Galerkin
approach using cubic or quintic B splines
B. Periodic boundary conditions of 1 dimensional pde with finite element
C. Coupling several problems with 1 dimensional pde as in designing
optical fibers
D. Ax = λB x, A and B symmetric, B positive definite, A and B banded
A. B tridiagonal mass matrix in finite element approximation of 1 D
problem,
E. Banded singular value decomposition-S. Rajamanickam thesis under Tim
Davis
F. To prevent fill in when pivoting in Symmetric indefinite banded
factorization
A. Optimization problem with negative curvature
QZ phase 1 reduction of A to upper Hessenberg B to
triangular for solving nonsymmetric Ax = λB x
Assume we have used orthogonal transformations to reduce B to triangular
and have applied them to A. We now have
A
B
A
→
In LAPACK get rid of
elements of A in the order of
B
A
B
→
But there are independent operations that can
be done simultaneously using the ordering
In general 2n simultaneous
operations. B. Kagstrom &
Dackland, 1999
One can look at these as
blocks or individual elements
IMTQL1- finding the eigenvalues of Ax=λx
for A symmetric tridiagonal
1 Compute shift μ, form B =A- μI
2. Find Q1 that annihilates B21 and form Q1B Q1T
→
→
3. Chase unwanted element down matrix
Parallel QR- keep on determining shift and do simultaneous chases.
Van-de Gijn(1993)- 3 times as many chases but Kaufman showed can
get factor of 2 reduction (1994)
→
Diagram of annihilation using Given’s rotations for
banded eigenvalue
Eventually every kth row have element that could be annihilated for r
2k-1 diagonals (Kaufman-1984) implemented in Lapack(Christian Bischof, Bruno Lang, and XiaobaiSun -SBR toolbox 2000)
Saw reduction by factor of 5 for narrow bands on Cray
Would like to have been able to generate Givens rotations
simultaneously- killed by manufacturers
Diagrams of parallel Crawford for reducing symmetric
tridiagonal Ax = λB x, to standard eigenvalue problem
A
A
B
→
→
→
B
Simultaneous Stabilized elementary transformations
• Simultaneously factoring banded linear systems
• Tinvit- Given several eigenvalues (λ1, λ2, … λk) of a
tridiagonal system determine their eigenvectors. Solve A- λ1
I, A- λ2 I, … A- λk I simultaneously
• Shot gun bisection
• Two dimensional separable elliptic PDE’s solved using
marching(Bank), Rayleigh Ritz Galerkin(Kaufman and
Warner), or Collocation(Fairweather).
• Matrix has form A =
(Tensor Product where
S’s and M’s are banded. )
• Queueing Problems leading to separable matrices(Kaufman,
1983)
• Symmetric Indefinite using stabilized elementary to prevent fill-in
Separable matrix in solving 2 dimensional
Matrix hasthe form A =
(Tensor Product )
Where Sx and Mx are m x m banded and Sy and My are n xn
banded, Mx , My, and Sy are symmetric and My is positive
definite
Need to Solve Av = f where A has mn rows and columns
Algorithm:
(1) Find D and Z such that ZT My Z = I and ZT Sy Z = D
Generalize eigenvalue problem
(2) Compute g =
f
(3) Solve the n banded systems given by
(4)Compute v =
h=g
h
Steps 2 and 4 just use Matrix-Matrix multiply and are fast.
Sometimes Z and D are known apriori like for Poisson’s
equation with uniform grid.
One can reduce Step(3) by factor of 4 by using simultaneous
axpy’s.
Queueing Problems
Often working with singular matrices A with the form
Where the B’s are not symmetric but might have symmetric zero
structure, there exists matrices Q and Z such QBjZ is diagonal.
Usually all the B’s are the identity matrix except for one. The
variable q denotes the number of queues. The problems get large
quickly. If one has 10 waiting spaces in q queues the number of
variables is 0(10q)
As in the pde case, one can reduce the problem using generalized
eigendecompositions to diagonal blocks containing tridiagonal
matrices, which here could be unsymmetric-
Symmetric rank k updates
Originally updates could have the form A= A + α XXT
Could not accommodate the quasi- Newton BFGS update of the
approximate Hessian in optimization
LAPACK did not use it and instead treated the lower
triangular part as shown where the rectangles used GEMM
and small lines that used DGEMV
Could use
simultaneous
DGEMV here
Workarounds for symmetric updates
For symmetric indefinite linear systems, the
reduction uses either A=A+ YDYT where D is either 1
x 1 or 2 x 2.
For block D would be a sequence of 1 x 1 or 2 x 2.
2002 Blas suggest A=A + YJYT where J was
tridiagonal- I don’t know of any implementations.
Perhaps better to have A=A+XYT but only update
triangular part of A
At 2011 Householder conference Jennifer Scott suggested
adding extra space so that one can use GEMM’s throughout
Symmetric banded factorization
For symmetric banded matrices, Kaufman’s retraction
algorithm requires (2m+1)n space even though the original
matrix can be specified using (m+1)n.
The extra space is to store complications for 2 x 2.
Thus one can imagine the image below and the stuff above
diagonal is just scratch space with 1 x 1’s.
Bunch Kaufman for symmetric indefinite non banded
Partition A as
Where D is either 1 x 1 or 2 x 2
Reset B to B’ = B – Y D-1 YT
when deleting Y’s
Choice of dimension of D
depends on magnitudes of a11
versus other elements
Continue with B’ and partition
it as above
Bandwidth spread with Bunch-Kaufman on
banded matrix because of pivoting for stability
Banded algorithm based on B-K
1) Let c = |ar 1 | = max in abs. in col. 1
2) If |a11 | >= w c, use a 1 x 1 pivot. Here w is a scalar to balance
element growth, like 1/3
Else
3)Let f= max element in abs. in column r
4) If w c*c <= |a11 | f, use a 1 x 1 pivot
Else
5)interchange the rth and second rows and
columns of A
6) Do a sequence of orthogonal or elementary
transformation to prevent fill-in while performing a 2 by 2 pivot
7) Perform a 2 x 2 pivot
Never pivot with 1 x 1
Pivoting for stability can ruin bandwidth
Worst case r =m, what happens in pivoting
x x x x x x
x x x x x x x a b c d
x x x x x x x x
x x x x x x x x x
x x x x x x x x x x
x x x x x x x
x x x x x x x x x x x
a x x x x x x x x x
b
x x x x x x x x
c
x x x x x x x
d
x x x x x x
x x x x x x
Partition A as
Reset B’ = B – Y D-1 YT= YZ
Let Z = D-1 YT
x x x x x x x x x
p q r s x x x x x .
Then B’ looks like
x x x x
x x x x
x x x x
x x x x
x x x x
x x x as
bp x x bs
cp cq x cs
dp dq dr ds
x
x
x
x
x
x
x
x
x
x
x bp cp dp
x x cq dq
x x x dr
as bs cs ds
x x x x x
x x x x x
x x x x x
x x x x x
x x x x x
x x x x x
which by the
x x x x x x 0 0 0
x x x x x x x cq dq
elimination of 1 x x x x x x x x dr
x x x x x x bt ct dt
element
x x x x x x x x x x
x x x x x x x x x x
becomes
0 x x bt x x x x x x
0 cq x ct x x x x x x
0 dq dr dt x x x x x x
x x x x x x
Continuing in this way gets us back to band form
In practice- Pretreat Z to make zeroes so that rank 2
change does not produce zeroes outside the band
Partition A as
where D is 2 X 2
Let Z = D-1 YT
Reset B’ = QT(B – Q Z)Q= QTB Q-HG, Q from fixup
Where H=QTY and G =Z Q
Construct Q so that G= Z Q looks like
x x x

0 0 0
x
x
x
x
x

x
Use a sequence of Givens transformations or stabilized
planar elementary transformations to form Q so banded
structure of QTBQ is not upset.
Because H= QY has form
HG will not extend
beyond band
 x x


 x x
 x x
 0 x


 0 x


 0 x
Comparison with Lapack on positive definite
n=2000
posdef
mine-no
block
dgbtrf
dgbtf2
dpbtrf
dpbtf2
mine,nb=16
100
0.223
0.266
0.389
0.17
0.218
0.13
200
0.782
0.873
1.46
0.67
0.773
0.382
300
1.64
1.65
3.12
1.49
1.62
0.834
400
2.78
2.41
5.28
2.59
2.76
1.25
500
4.2
3.78
8.14
3.9
4.13
1.76
600
5.61
4.81
11.31
5.62
5.63
2.343
700
7.35
6.3
15.36
7.88
7.33
2.977
18
16
14
mine-dsyr
12
dgbtrf
10
dgbtf2
8
dpbtrf
6
dpbtf2
4
mine,nb=16
2
0
0
200
400
600
800
Block version on random matrices-n=2000
nonblock- block
retraction retraction dgbtf2
dgbtrf
2x2
100
0.327
0.315
0.682
0.451
200
0.986
0.81
2.58
1.3
300
2.08
1.79
5.22
2.58
400
3.37
2.6
8.92
3.93
500
5.45
4.38
13.97
5.85
600
7.19
5.61
24.84
7.64
700
10.23
8.46
37.03
9.85
m
maxr ave
444
49
315
98
365
141
327
201
370
231
344
299
421
293
Only blocking for 1 x 1, stop accumulating when a 2 x 2 is reached.
Elementary transformations for “pretreating” Z. with 2 x 2
6
Time as a function
of number of
planar
transformations
m=400, n=2000.
5
4
retraction
3
dgbtrf
2
1
0
0
50000
100000
150000
200000
250000
Possible ways to speedup retraction for consecutive 2 x 2s:
Each column involves 2 full daxpys plus orthogonal transformations or
cut up daxpys to the same column
(1) marching(1) work on column i+j when elimination starts at i
(2) Work on column i+j-1 with elimination starting at i+2
(3) Work on column i+j-2 with elimination starting at i+4
Requires simultaneous dgemvs or daxpys
(2) 2 sets of transformations
(1) Cut up daxpy or orthogonal transformation applied to i+j
stemming from i
(2) Dgemv involving 4 columns (i,i+1,i+2,i+3) applied to i+j
(3) Cut up daxpy applied to i+j stemming from i+2
Back to requesting simultaneous orthogonal or
elementary transformations
Van de Geign to the rescue
Download