Resources for parallel computing

advertisement
Resources for parallel computing
BLAS
Basic linear algebra subprograms. Originally published in ACM Toms (1979)
(Linpack  Blas + Lapack). Implement matrix operations upto matrix-matrix multiplication and triangular solve, but not matrix factorizations or eigenvalue calculations. A reference implementation is on netlib.org.
Web page
www.netlib.org/blas
"Frequently asked questions"
BLAS
1) What and where are the BLAS?
2) Are there legal restrictions on the use of BLAS reference implementation software?
3) Publications/references for the BLAS?
4) Is there a Quick Reference Guide to the BLAS available?
5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS?
6) Where can I find Java BLAS?
7) Is there a C interface to the BLAS?
1
8) Are prebuilt reference implementations of the Fortran77 BLAS available?
9) What about shared memory machines? Are there multithreaded versions of the BLAS available?
1) What and where are the BLAS?
The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-….
On Wikipedia
Example of a reference subroutine:
*
*
*
*
SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY )
.. Scalar Arguments ..
DOUBLE PRECISION
ALPHA, BETA
INTEGER
INCX, INCY, LDA, M, N
CHARACTER*1
TRANS
.. Array Arguments ..
DOUBLE PRECISION
A( LDA, * ), X( * ), Y( * )
Purpose
2
*
*
*
*
*
*
*
*
*
*
...
*
*
*
...
=======
DGEMV performs one of the matrix-vector operations
y := alpha*A*x + beta*y,
or
y := alpha*A'*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an
m by n matrix.
Parameters
==========
M
50
60
- INTEGER.
On entry, M specifies the number of rows of the matrix A.
M must be at least zero.
IF( INCY.EQ.1 )THEN
DO 60, J = 1, N
IF( X( JX ).NE.ZERO )THEN
TEMP = ALPHA*X( JX )
DO 50, I = 1, M
Y( I ) = Y( I ) + TEMP*A( I, J )
CONTINUE
END IF
JX = JX + INCX
CONTINUE
...
BLAS quick reference
(see www.netlib.org/blas/blasqr.pdf)
3
LAPACK
Fortran subroutines for linear equations (dense, banded), linear least squares
problems, eigenvalue problems and singular values. There is a printed user guide
(LAPACK Users' Guide, 11 authors, SIAM, 1999), part of this guide is in several
html documents on www.netlib.org/lapack/lug, and there are man pages on
www.netlib.org/lapack/manpages.tgz which are worth installing. The routines
were written with parallel computation in mind.
Web page
www.netlib.org/lapack
4
Atlas
www.math-atlas.sourceforge.net. Open source implementaion of BLAS and a
few LAPACK routines.These must be (or should be) built (with make) on the
machine where they are to be used (may take a few hours?).
These must be (or should be) built (with make) on the machine where they are to
be used (may take a few hours?).
GotoBLAS (framb.: gótóblas)
http://www.tacc.utexas.edu/resources/software/. The web page boasts: currently
the fastest implementation of the Basic Linear Algebra Subroutines
5
Intel MKL (Math kernel library)
Academic license $160. Includes:












BLAS
Selected LAPACK routines
Fortran 95 interface
CBLAS (interface to call from C)
Sparse BLAS
Sparse linear equation solvers
ScaLAPACK
Some statistical functions (incl. random number generation)
some MPI support
Fast Fourier transforms
PDE solution support (I think incl. Poisson solver)
some numerical optimization.
User manual and reference manual
These come with the program, and are also both available on
http://www.intel.com/cd/software/products/asmo-na/eng/345631.htm
(107 pages and 3250 pages).
AMD Core Math Library (ACML)
web page
www.amd.com/acml.
6
DGEMM benchmarks
From Intel Web site:
7
From GotoBLAS web site:
From a neutral (?) web site:
8
Example of BLAS use
The following examples are from programs for evaluation of VARMA time-series
likelihood that I wrote first in Matlab (~3500 lines) and am close to finishing translating into C (~7000 lines). Eventually I hope to call some parallel BLAS routines
and report on timing comparison between the Matlab and C versions. The Matlab
programs are on www.hi.is/~jonasson, and the C-programs are on the way there.
omega_factor.m
%OMEGA_FACTOR Cholesky factorization of Omega
%
% [Lu,Ll,info] = omega_factor(Su,Olow,p,q,ko) calculates the Cholesky
% factorization Omega = L· L' of Omega which is stored in two parts, a full
% upper left partition, Su, and a block-band lower partition, Olow, as returned
% by omega_build. Omega is symmetric, only the lower triangle of Su is
% populated, and Olow only stores diagonal and subdiagonal blocks. On exit, L =
% [L1; L2] with L1 = [Lu 0] and L2 is stored in block-band-storage in Ll. Info
% is 0 on success, otherwise the loop index resulting in a negative number
% square root. P and q are the dimensions of the problem and ko is a vector
% with ko(t) = number of observed values before time t.
%
% In the complete data case ko should be 0:r:n*r. For missing values, Su and
% Olow are the upper left and lower partitions of Omega_o = Omega with missing
% rows and columns removed. In this case Lu and Ll return L_o, the Cholesky
% factor of Omega_o.
function [Lu,Ll,info] = omega_factor(Su,Olow,p,q,ko)
n = length(ko)-1;
h = max(p,q);
ro = diff(ko);
Ll = zeros(size(Olow));
[Lu,info] = chol(Su'); % upper left partition
if info>0, return; end
Lu = Lu';
e = ko(h+1);
% order of Su
for t = h+1 : n % loop over block-lines in Olow
K = ko(t)+1 : ko(t+1);
KL = K - ko(t-q);
JL = 1 : ko(t)-ko(t-q);
tmin = t-q; tmax = t-1;
Ll(K-e, JL) = omega_forward(Lu, Ll, Olow(K-e,JL)', p, q, ko, tmin, tmax)';
[Ltt, info] = chol(Olow(K-e,KL) - Ll(K-e,JL)*Ll(K-e,JL)');
if info>0, info = info + ko(t); return; end
Ll(K-e, KL) = Ltt';
end
end
9
OmegaFactor.c
#include "xAssert.h"
#include "BlasGateway.h"
#include "VarmaUtilities.h"
#include "Omega.h"
void OmegaFactor ( // Cholesky-factorize Omega, or, for missing values, Omega_o
double Su[],
// in/out upper left part of Omega, dimension mSu × mSu
double Olow[], // in/out mOlow×nOlow, block-diagonals of lower part of Omega
int p,
// in
number of autoregressive terms
int q,
// in
number of moving average terms
int n,
// in
length of time series
int ko[],
// in
ko[t] = N of observed values before time t+1, t<=n
int *info)
// out
0 if ok, otherwise k for first nonpositive Ltt
{
// Finds Cholesky factorization of covariance matrix Omega_o for missing value
// VARMA log-likelihood. Also handles complete data with ko[SWS]=r*i for all i.
// The Cholesky factors overwrite Omega in memory. All matrices are stored in
// Fortran fashion.
//
double *U, *Ltt;
int t, j, ro;
int h = max(p,q);
int mSu = ko[h];
// order of Su
int mOlow = ko[n]-mSu; // no. of rows in Olow
//
// CHOLESKY-FACTORIZE Su INTO Lu· Lu' (Lu OVERWRITES Su):
if (mSu>0)
potrf("Low", mSu, Su, mSu, info);
else
*info = 0;
xAssert(*info >= 0);
if (*info>0) return;
//
// TURN ATTENTION TO Olow:
U = Olow;
for (t=h; t<n; t++) {
ro = ko[t+1] - ko[t];
if (ro>0) {
j = ko[t] - ko[t-q];
Ltt = U + mOlow*j;
//Solve L(0:t-1,0:t-1)· U' = s' (U contains s on call):
OmegaForward("T", Su, Olow, p, q, ko, n, U, mOlow, ro, t-q, t-1);
//Ltt-U· U' and then Cholesky of that
syrk("Low", "N", ro, j, -1.0, U, mOlow, 1.0, Ltt, mOlow);
potrf("Low", ro, Ltt, mOlow, info);
if (*info>0) { *info += ko[t]; return; }
}
U += ro;
}
}
10
From omega_forward.m
LT.LT = true;
if m>0, X(1:m,:) = linsolve(Lu(j:end,j:end), Y(1:m,:), LT); end
for t=max(h+1,tmin):tmax
t1 = max(t-q,tmin);
J = ko(t1)+1 : ko(t);
K = ko(t)+1 : ko(t+1);
JX = J - j + 1;
KX = K - j + 1;
JL = J - ko(t-q);
KL = K - ko(t-q);
X(KX,:) = X(KX,:) - Ll(K-e,JL)*X(JX,:);
X(KX,:) = linsolve(Ll(K-e,KL), X(KX,:), LT);
end
From OmegaForward.c
if (transp && e>0)
trsm("Right", "Low", "T", "NotUdia", nY, m, 1.0, Luii, e, Y, iY);
else if (e>0)
trsm("Left","Low", "NoT", "NotUdia", m, nY, 1.0, Luii, e, Y, iY);
//
// FIND SECOND PARTITION OF X:
incY = transp ? iY : 1;
tbeg = max(h,tmin);
for (t=tbeg; t<=tmax; t++) {
k1 = max(t-q,tmin);
k = ko[t] - ko[k1];
Yt = Y + (ko[t] - j)*incY;
Yk = Yt - k*incY;
Lt = Ll + ko[t] - e;
Ltt = Lt + mLl*(ko[t] - ko[t-q]);
Ltk = Ltt - mLl*k;
ro = ko[t+1] - ko[t];
if (transp && iY>0) {
gemm("NT", "T", nY, ro, k, -1.0, Yk, iY, Ltk, mLl, 1.0, Yt, iY);
trsm("Right", "Low", "T", "NotUdia", nY, ro, 1.0, Ltt, mLl, Yt, iY);
}
else if (!transp && mLl>0)
{
gemm("NT", "NT", ro, nY, k, -1.0, Ltk, mLl, Yk, iY, 1.0, Yt, iY);
trsm("Left", "Low", "NT", "NotUdia", ro, nY, 1.0, Ltt, mLl, Yt, iY);
}
}
11
Gateway function to reference Blas/Lapack
// Gateway function to reference Blas/Lapack
#include "BlasGateway.h"
#include "blasF.h"
void gemm(char *transa, char *transb, int m, int n, int k, double alpha,
double a[], int lda, double b[], int ldb, double beta, double c[], int ldc) {
dgemm(transa, transb, &m, &n, &k, &alpha, a, &lda, b,&ldb,&beta,c,&ldc,1,1);
}
BLAS/LAPACK from Fortran examples
Cholesky factorization using Netlib LAPACK95
From http://www.netlib.org/lapack95/html/DOC:
!
SUBROUTINE LA_POTRF( A, UPLO, RCOND, NORM, INFO )
!
<type>(<wp>), INTENT(INOUT) :: A(:,:)
!
CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: UPLO
!
REAL(<wp>), INTENT(OUT), OPTIONAL :: RCOND
!
CHARACTER(LEN=1), INTENT(IN), OPTIONAL :: NORM
!
INTEGER, INTENT(OUT), OPTIONAL :: INFO
Cholesky factorization from MKL
Excerpt from MKL Reference Manual:
12
Matrix-matrix multiply
Using Intel MKL (help from the reference manual):
There is a description (incompatible with MKL) and a link to a reference implementation on Netlib: http://www.netlib.org/blas/blast-forum (difficult to download and use). Here is an example from the description:
For calling from Fortran 77 see the call to the reference version of DGEMM as
shown above in the quote from the MKL reference manual.
13
Sparse BLAS
A Fortran 95 reference implementation was published in ACM TOMS in 2002:
Included in MKL.
Originally defined by the BLAS Technical Forum, see Netlib
(http://www.netlib.org/blas/blast-forum):
14
CBLAS
Netlib has a description and an implementation (interface to the Fortran reference BLAS), see http://www.netlib.org/blas/blast-forum. Example:
Notice that the dimension arguments are passed by value, and not by reference as
is necessary when calling the Fortran routines directly from C.
Information on MKL web: www.intel.com/software/products/mkl/docs/mklqref/
Atlas (see above) has a free implementation and a quick reference card.
GNU Scientific Library (http://www.gnu.org/software/gsl) has interface with a
different interface (and their own vector / matrix data types).
15
Download