The Future of LAPACK and ScaLAPACK

advertisement
The Future of
LAPACK and ScaLAPACK
www.netlib.org/lapack-dev
Jim Demmel
UC Berkeley
28 Sept 2005
Outline
•
•
•
•
•
•
Motivation
Participants
Goals
1. Better numerics (faster and more accurate algorithms)
2. Expand contents (more functions, more parallel
implementations)
3. Automate performance tuning
4. Improve ease of use
5. Better maintenance and support
6. Increase community involvement (this means you!)
Questions for the audience
Selected Highlights
Concluding poem
Motivation
• LAPACK and ScaLAPACK are widely used
– Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks,
NAG, NEC, SGI, …
– >52M web hits @ Netlib (incl. CLAPACK, LAPACK95)
• Many ways to improve them, based on
–
–
–
–
Own algorithmic research
Enthusiastic participation of research community
On-going user/vendor survey (url below)
Opportunities and demands of new architectures,
programming languages
• New releases planned (NSF support)
• Your feedback desired
– www.netlib.org/lapack-dev
Participants
• UC Berkeley:
– Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li,
Osni Marques, Christof Voemel, David Bindel, Yozo Hida, Jason
Riedy, Jianlin Xia, Jiang Zhu, undergrads…
• U Tennessee, Knoxville
– Jack Dongarra, Victor Eijkhout, Julien Langou, Julie Langou, Piotr
Luszczek, Stan Tomov
• Other Academic Institutions
– UT Austin, UC Davis, Florida IT, U Kansas, U Maryland,
North Carolina SU, San Jose SU, UC Santa Barbara
– TU Berlin, FU Hagen, U Madrid, U Manchester, U Umeå,
U Wuppertal, U Zagreb
• Research Institutions
– CERFACS, LBL
• Industrial Partners
– Cray, HP, Intel, MathWorks, NAG, SGI
Goal 1 – Better Numerics
• Fastest algorithm providing “standard” backward stability
– MRRR algorithm for symmetric eigenproblem / SVD:
Parlett / Dhillon / Voemel / Marques / Willems
– Up to 10x faster HQR: Byers / Mathias / Braman
– Extensions to QZ: Kågström / Kressner
– Faster Hessenberg, tridiagonal, bidiagonal reductions:
van de Geijn, Bischof / Lang , Howell / Fulton
– Recursive blocked layouts for packed formats:
Gustavson / Kågström / Elmroth / Jonsson/
Goal 1 – Better Numerics
• Fastest algorithm providing “standard” backward stability
– MRRR algorithm for symmetric eigenproblem / SVD:
Parlett / Dhillon / Voemel / Marques / Willems
– Up to 10x faster HQR: Byers / Mathias / Braman
– Extensions to QZ: Kågström / Kressner
– Faster Hessenberg, tridiagonal, bidiagonal reductions:
van de Geijn, Bischof / Lang , Howell / Fulton
– Recursive blocked layouts for packed formats:
Gustavson / Kågström / Elmroth / Jonsson/
• New: Most accurate algorithm providing “standard” speed
– Iterative refinement for Ax=b, least squares
• Assume availability of Extra Precise BLAS (Li/Hida/…)
• www.netlib.org/blas/blast-forum/
– Jacobi SVD: Drmaĉ/Veselić
Goal 1 – Better Numerics
• Fastest algorithm providing “standard” backward stability
– MRRR algorithm for symmetric eigenproblem / SVD:
Parlett / Dhillon / Voemel / Marques / Willems
– Up to 10x faster HQR: Byers / Mathias / Braman
– Extensions to QZ: Kågström / Kressner
– Faster Hessenberg, tridiagonal, bidiagonal reductions:
van de Geijn, Bischof / Lang , Howell / Fulton
– Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/
• New: Most accurate algorithm providing “standard” speed
– Iterative refinement for Ax=b, least squares
• Assume availability of Extra Precise BLAS (Li/Hida/…)
• www.netlib.org/blas/blast-forum/
– Jacobi SVD: Drmaĉ/Veselić
• Condition estimates for (almost) everything (ongoing)
Goal 1 – Better Numerics
• Fastest algorithm providing “standard” backward stability
– MRRR algorithm for symmetric eigenproblem / SVD:
Parlett / Dhillon / Voemel / Marques / Willems
– Up to 10x faster HQR: Byers / Mathias / Braman
– Extensions to QZ: Kågström / Kressner
– Faster Hessenberg, tridiagonal, bidiagonal reductions:
van de Geijn, Bischof / Lang , Howell / Fulton
– Recursive blocked layouts for packed formats,
Gustavson / Kågström / Elmroth / Jonsson/
• New: Most accurate algorithm providing “standard” speed
– Iterative refinement for Ax=b, least squares
• Assume availability of Extra Precise BLAS (Li/Hida/…)
• www.netlib.org/blas/blast-forum/
– Jacobi SVD: Drmaĉ/Veselić
• Condition estimates for (almost) everything (ongoing)
• What is not fast or accurate enough?
What goes into Sca/LAPACK?
For all linear algebra problems
For all matrix structures
For all data types
For all architectures and networks
For all programming interfaces
Produce best algorithm(s) w.r.t.
performance and accuracy
(including condition estimates, etc)
Need to automate and prioritize!
Goal 2 – Expanded Content
• Make content of ScaLAPACK mirror LAPACK as much
as possible
– Full Automation would be nice, not yet robust, general enough
• Telescoping languages, Bernoulli, Rose, FLAME, …
• New functions (examples)
– Updating / downdating of factorizations: Stewart, Langou
– More generalized SVDs: Bai , Wang
– More generalized Sylvester/Lyapunov eqns:
Kågström, Jonsson, Granat
– Structured eigenproblems
• O(n2) version of roots(p) – Gu, Chandrasekaran, Zhu et al
• Selected matrix polynomials: Mehrmann
• How should we prioritize missing functions?
Goal 3 – Automate Performance Tuning
• Not just BLAS
• 1300 calls to ILAENV() to get block sizes, etc.
– Never been systematically tuned
• Extend automatic tuning techniques of ATLAS,
etc. to these other parameters
– Automation important as architectures evolve
• Convert ScaLAPACK data layouts on the fly
• How important is peak performance?
Goal 4: Improved Ease of Use
• Which do you prefer?
A\B
CALL PDGESV( N ,NRHS, A, IA, JA, DESCA,
IPIV, B, IB, JB, DESCB, INFO)
CALL PDGESVX( FACT, TRANS, N ,NRHS, A,
IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV,
EQUED, R, C, B, IB, JB, DESCB, X, IX, JX,
DESCX, RCOND, FERR, BERR, WORK,
LWORK, IWORK, LIWORK, INFO)
Goal 4: Improved Ease of Use,
Software Engineering (1)
• Development versus Research
– Development: practical approach to produce useful code
– Research: If we could start over and do it right, how would we?
• Life after F77?
– Fortran95, C, C++, Java, Matlab, Python, …
• Easy interfaces vs access to details
– No universal agreement across systems on “easiest interface”
– Leave decision to higher level packages
– Keep expert driver / simple driver / computational routines
• Conclusion: Subset of F95 for core + wrappers for drivers
– What subset?
• Recursion, for new data structures
• Modules, to produce multiple precision versions
• Environmental enquiries, to replace xLAMCH
– Wrappers for Fortran95, Java, Matlab, Python, … even for CLAPACK
• Automatic memory allocation of workspace
Goal 4: Improved Ease of Use,
Software Engineering (2)
• Why not full F95 for core?
– Would make interfacing to other languages and
packages harder
– Some users want control over memory allocation
• Why not C for core?
– High cost/benefit ratio for full rewrite
– Performance
• Automation would be nice
– Use Babel/SIDL to produce native looking interfaces
when possible
Precisions beyond double (1)
• Range of designs possible
– Just run in quad (or some other fixed precision)
– Support codes like
#bits = 32
Repeat
#bits = 2* #bits
Solve(A, b, x, error_bound, #bits)
Until error_bound < tol
Precisions beyond double (2)
• Easiest approach – fixed precision
– Use F95 modules to produce any precision on request
– Could use QD, ARPREC, GMP, …
– Keep current memory allocation (twiddle GMP…)
• Next easier approach – maximum precision
– Build maximum allowable precision on request
– Pass in precision parameter, up to this amount
• More flexible (and difficult) approach
– Choose any precision at run time
– Dynamically allocate all variables
• Most aggressive approach
– New algorithms that minimize work to get desired prec.
• What do users want?
– Compatibility with symbolic manipulation systems?
Goal 4: Improved Ease of Use,
Software Engineering (3)
• Research Issues
– May or may not impact development
• How to map multiple software layers to emerging
architectural layers?
• Are emerging HPCS languages better?
• How much can we automate? Do we keep having
to write Gaussian elimination over and over again?
• Statistical modeling to limit performance tuning
costs, improve use of shared clusters
Goal 5:Better Maintenance and Support
• Website for user feedback and requests
• New developer and discussion forums
• URL: www.netlib.org/lapack-dev
– Includes NSF proposal
•
•
•
•
•
Version control and bug tracking system
Automatic Build and Test environment
Wide variety of supported platforms
Cooperation with vendors
Anything else desired?
Goal 6: Involve the Community
• To help identify priorities
– More interesting tasks than we are funded to do
– See www.netlib.org/lapack-dev for list
• To help identify promising algorithms
– What have we missed?
• To help do the work
–
–
–
–
Bug reports, provide fixes
Again, more tasks than we are funded to do
Already happening: thank you!
We retain final decisions on content
• Anything else?
Some Highlights
•
•
•
•
•
•
Putting more of LAPACK into ScaLAPACK
ScaLAPACK performance on 1D vs 2D grids
MRRR “Holy Grail” algorithm for symmetric EVD
Iterative Refinement for Ax=b
O(n2) polynomial root finder
Generalized SVD
Missing Drivers in Sca/LAPACK
LAPACK
ScaLAPACK
Linear
Equations
LU
Cholesky
LDLT
xGESV
xPOSV
xSYSV
PxGESV
PxPOSV
missing
Least Squares
(LS)
QR
QR+pivot
SVD/QR
SVD/D&C
SVD/MRRR
QR + iterative refine.
xGELS
xGELSY
xGELSS
xGELSD
missing
missing
PxGELS
missing driver
missing driver
missing (intent)
missing
missing
Generalized LS LS + equality constr.
Generalized LM
Above + Iterative ref.
xGGLSE missing
xGGGLM missing
missing missing
More missing drivers
LAPACK
ScaLAPACK
Symmetric EVD
QR / Bisection+Invit
D&C
MRRR
xSYEV / X
xSYEVD
xSYEVR
PxSYEV / X
missing (intent)
missing
Nonsymmetric EVD
Schur form
Vectors too
xGEES / X
xGEEV /X
missing driver
missing driver
SVD
QR
D&C
MRRR
Jacobi
xGESVD
xGESDD
missing
missing
PxGESVD
missing (intent)
missing
missing
Generalized
Symmetric EVD
QR / Bisection+Invit
D&C
MRRR
xSYGV / X
xSYGVD
missing
PxSYGV / X
missing (intent)
missing
Generalized
Nonsymmetric EVD
Schur form
Vectors too
xGGES / X
xGGEV / X
missing
missing
Generalized SVD
Kogbetliantz
MRRR
xGGSVD
missing
missing (intent)
missing
Missing matrix types in ScaLAPACK
• Symmetric, Hermitian, triangular
– Band, Packed
• Positive Definite
– Packed
• Orthogonal, Unitary
– Packed
Execution time of PDGESV for various grid shape
100
seconds
90-100
90
80-90
80
70-80
70
60-70
60
50-60
40-50
50
1x60
40
2x30
30
3x20
20
4x15
10
grid shape
5x12
0
6x10
10000 9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups for using 2D processor grid range from 2x to 8x
Times obtained on:
60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
2GB Memory
30-40
20-30
10-20
0-10
Optimal grid (6x10) for PDGESV
Comparison between Computation and
Redistribution of Data from Linear Grid
100
10
seconds
Calculation Time
1
0.1
0.01
1000
Redistribution
Time
4000
7000
1000
problem size
Cost of redistributing matrix to optimal layout is small
Times obtained on:
60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
2GB Memory
MRRR Algorithm for
eig(tridiagonal) and svd(bidiagonal)
• “Multiple Relatively Robust Representation”
• 1999 Householder Award honorable mention for Dhillon
• O(nk) flops to find k eigenvalues/vectors of nxn tridiagonal matrix (similar
for SVD)
– Minimum possible!
• Naturally parallelizable
• Accurate
– Small residuals || Txi – li xi || = O(n e)
– Orthogonal eigenvectors || xiTxj || = O(n e)
• Hence nickname: “Holy Grail”
• 2 versions
– LAPACK 3.0: large error on “hard” cases
– Next release: fixed!
• How should we tradeoff speed and accuracy?
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
Timing of Eigensolvers
(only matrices where time > .1 sec)
Accuracy Results (old vs new Grail)
maxi ||Tqi – li qi || / ( n e )
|| QQT – I || / (n e )
Accuracy Results (Grail vs QR vs DC)
maxi ||Tqi – li qi || / ( n e )
|| QQT – I || / (n e )
More Accurate: Solve Ax=b
Conventional Gaussian Elimination
1/e
e
e = n1/2 2-24
With extra precise
iterative refinement
What’s new?
• Need extra precision (beyond double)
– Part of new BLAS standard
– Cost = O(n2) extra per right-hand-side, vs O(n3) to factor
• Get tiny componentwise bounds too
– Error in xi small compared to |xi|, not just maxj |xj|
• “Guarantees” based on condition number estimates
–
–
–
–
No bad bounds in 6.2M tests
Different condition number for componentwise bounds
Traditional iterative refinement can “fail”
Only get “matrix close to singular” message when answer wrong?
• Extends to least squares
• Demmel, Kahan, Hida, Riedy, X. Li, Sarkisyan, …
• LAPACK Working Note # 165
Can condition estimators lie?
• Yes, but rarely, unless they cost as much as
matrix multiply = cost of LU factorization
– Demmel/Diament/Malajovich (FCM2001)
• But what if matrix multiply costs O(n2)?
–
Cohn/Umans/Kleinberg (FOCS 2003/5)
New algorithm for roots(p)
-p1 -p2
1 0
– Roots(p) does eig(C(p))
C(p)= 0 1
– Costs O(n3), stable, reliable
……
• O(n2) Alternatives
0 …
– Newton, Jenkins-Traub, Laguerre, …
• To find roots of polynomial p
… -pd
… 0
… 0
… …
1 0
– Stable? Reliable?
• New: Exploit “semiseparable” structure of C(p)
– Low rank of any submatrix of upper triangle of C(p) preserved
under QR iteration
– Complexity drops from O(n3) to O(n2)
• Related work: Gemignani, Bini, Pan, et al
• Ming Gu, Shiv Chandrasekaran, Jiang Zhu, Jianlin Xia, David
Bindel, David Garmire, Jim Demmel
Properties of new roots(p)
• First (?) algorithm that
– Is O(n2)
– Is backward stable, in the matrix sense
– Is backward stable, in the sense that the computed
roots are the exact roots of a slightly perturbed input
polynomial
• Depends on balancing = scaling roots by a constant 
• Still need to automate choice of 
• Byers, Mathias, Braman
• Tisseur, Higham, Mackey …
New GSVD Algorithm: Timing Comparisons
1.0GHz Itanium–2 (2GB RAM);
Intel's Math Kernel Library 7.2.1 (incl. BLAS, LAPACK)
Bai et al, UC Davis
PSVD, CSD on the way
Conclusions
• Lots to do in Dense Linear Algebra
– New numerical algorithms
– Continuing architectural challenges
• Parallelism, performance tuning
• Grant support, but success depends on
contributions from community
Extra Slides
Downa Dating
With apologies to Robert Burns & Nick Higham
Should some equations be forgot
when overdetermīned?
Should some equations be forgot
using hyperbolic sines?
With hyperbolic sines, my dear
with hyperbolic sines,
We’ll hope to get some boundedness
with hyperbolic sines.
New GSVD Algorithm (XGGQSV)
• UT A Z =∑a( 0 R ), VT B Z =∑b( 0 R ); A: M x N, B: P xN
• Modified Van Loan's method
(1) Pre-processing: reveal rank of ( AT ; BT )T
(2) Split QR: reduce two upper triangular into one
(3) CSD: Cosine-Sine Decomposition
(4) Post-processing: assemble resulted matrices
• Workspace =  (max(M, N, P)) + 5L2 where L = rank(B)
• Outperforms current XGGSVD in LAPACK
Profile of SGGQSV
1.0GHz Itanium–2 (2GB RAM);
Intel's Math Kernel Library 7.2.1 (incl. BLAS, LAPACK)
Related New Routines
• CSD (XORCSD)
– UTQ1Z =∑a, VTQ2Z =∑b
– Based on Von Loan's method (1985)
– Dominated by the cost of SVD
– Workspace =  (Max(P+M, N))
• PSVD (XGGPSV)
– UTABV = ∑
– Based on work by Golub, Solna, Van Dooren (2000)
– Workspace =  (Max(M, P, N))
Benchmark Details
• AMD 1.2 GHz Athlon, 2GB mem, Redhat + Intel
compiler
• Compute all eigenvalues and eigenvectors of a
symmetric tridiagonal matrix T
• Codes compared:
–
–
–
–
qr: QR iteration from LAPACK: dsteqr
dc: Cuppen’s Divide&Conquer from LAPACK: dstedc
gr: New implementation of MRRR algorithm (“Grail”)
ogr: MRRR from LAPACK 3.0: dstegr (“old Grail”)
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec
Timing of Eigensolvers
(1.2 GHz Athlon, only matrices where time > .1 sec)
More Accurate: Solve Ax=b
• Old idea: Use Newton’s method on f(x) = Ax-b
– On a linear system?
– Roundoff in Ax-b makes it interesting (“nonlinear”)
– Iterative refinement
• Snyder, Wilkinson, Moler, Skeel, …
Repeat
r = Ax-b … compute with extra precision
Solve Ad = r … using LU factorization of A
Update x = x – d
Until “accurate enough” or no progress
Execution time of PDPOSV for various grid shapes
40
35-40
35
30-35
30
25-30
25
20-25
15-20
seconds 20
10-15
1x60
15
2x30
3x20
10
4x15
5
5x12
0
6x10
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups from 1.33x to 11.5x
Times obtained on:
60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect
2GB Memory
grid shape
5-10
0-5
Optimal grid (6x10) for PDPOSV
Comparison between Computation and
Redistribution of Data from Linear Grid
100
10
seconds
Calculation Time
1
0.1
0.01
1000
Redistribution
Time
4000
7000
problem size
Times obtained on:
60 processors, Dual Opteron 1.4GHz, 64-bit machine
2GB Memory, Myrinet Interconnect
1000
Execution time of PDGESV for various grid shapes
300
250-300
250
200-250
200
150-200
100-150
seconds 150
1x60
2x30
100
3x20
4x15
50
5x12
0
6x10
10000 9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups from 1.8x to 7.5x
Times obtained on:
60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect
2GB Memory
grid shape
50-100
0-50
Optimal grid (4x15) for PDGESV
Comparison between Computation and
Redistribution of Data from Linear Grid
100
10
Calculation Time
seconds
1
0.1
1000
4000
7000
1000
problem size
Times obtained on:
60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect
2GB Memory
Redistribution
Time
Execution time of PDPOSV for various grid shapes
80
70-80
70
60-70
50-60
60
40-50
50
30-40
seconds 40
1x60
2x30
30
3x20
20
4x15
10
5x12
0
6x10
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups from 2.2x to 6x
Times obtained on:
60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect
2GB Memory
grid shape
20-30
10-20
0-10
Optimal grid (6x10) for PDPOSV
Comparison between Computation and
Redistribution of Data from Linear Grid
100
10
Calculation Time
seconds
1
0.1
1000
Redistribution
Time
4000
7000
1000
problem size
Times obtained on:
60 processors, Dual Intel Xeon 2.4GHz, IA-32 w/Gigabit Ethernet Interconnect
2GB Memory
Execution time of PDGEHRD for various grid shape
450
400-450
400
350-400
350
300-350
300
250-300
200-250
250
150-200
seconds
1x60
200
2x30
150
3x20
100
4x15
50
grid shape
5x12
0
6x10
10000 9000
8000
7000
6000
5000
4000
3000
2000
1000
problem size
Speedups from 2.2x to 2.8x
Times obtained on:
60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per
node, 4 nodes connected with high-bandwidth low-latency switching network
Block size: 64
100-150
50-100
0-50
Optimal grid (6x10) for PDGEHRD
Comparison between Computation and
Redistribution of Data from Linear Grid
1000
7000
4000
Calculation Time
1000
1000
100
10
seconds
1
0.1
0.01
Redistribution
Time
problem size
Times obtained on:
60 processors, IBM SP RS/6000, 16 shared memory (16 Gbytes) processors per
node, 4 nodes connected with high-bandwidth low-latency switching network
Block size: 64
The Future of
LAPACK and ScaLAPACK
www.cs.berkeley.edu/~demmel/Sca-LAPACK-Proposal.pdf
Jim Demmel
UC Berkeley
Householder XVI
OO-like interfaces (VE)
• Aim: native looking interfaces in modern
languages (C++, F90, Java, Python…)
• Overloading to have identical interfaces for all
data types (precision/storage)
• Overloading to omit certain parameters (stride,
transpose,…)
• Expose data only when necessary (automatic
allocation of temporaries, pivoting
permutation,….)
• Mechanism: Use Babel/SIDL to interface to
existing F77 code base
Example of HLL interface (VE)
// C++ example
Lapack::PDTridiagonalMatrix A = Lapack::PDTridiagonalMatrix::_create();
x = (double*) malloc(SIZE*sizeof(double));
y = (double*) malloc(SIZE*sizeof(double));
A.MatVec(x,y); // or A.MatVec(MatTranspose,x,y);
A.Factor();
A.Solve(y);
! F90 example
! Create matrix from user array
allocate(data_array(20,25)) ! Fill in the array….
call CreateWithArray(A,data_array)
! Retrieve the array of a (factored) matrix
call GetArray(A,mat)
mat_array => mat%d_data
print *,'pivots',mat_array(0,0),mat_array(1,1)
Lots of details omitted here!
Download