Introduction to ScaLAPACK ScaLAPACK • ScaLAPACK is for solving dense linear systems and computing eigenvalues for dense matrices 2 ScaLAPACK • Scalable Linear Algebra PACKage • Developing team from University of Tennessee, University of California Berkeley, ORNL, Rice U.,UCLA, UIUC etc. • Support in Commercial Packages – – – – – NAG Parallel Library IBM PESSL CRAY Scientific Library VNI IMSL Fujitsu, HP/Convex, Hitachi, NEC 3 Description • A Scalable Linear Algebra Package (ScaLAPACK) • A library of linear algebra routines for MPP and NOW machines • Language : Fortran • Dense Matrix Problem Solvers – – – Linear Equations Least Squares Eigenvalue 4 Technical Information • ScaLAPACK web page – http://www.netlib.org/scalapack • ScaLAPACK User’s Guide • Technical Working Notes – http://www.netlib.org/lapack/lawn 5 Packages • LAPACK – Routines are single-PE Linear Algebra solvers, utilities, etc. • BLAS – Single-PE processor work-horse routines (vector, vector-matrix & matrix-matrix) • PBLAS – Parallel counterparts of BLAS. Relies on BLACS for blocking and moving data. 6 Packages (cont.) • BLACS – Constructs 1-D & 2-D processor grids for matrix decomposition and nearest neighbor communication patterns. Uses message passing libraries (MPI, PVM, MPL, NX, etc) for data movement. 7 Dependencies & Parallelism of Component Packages ScaLAPACK LAPACK PBLAS BLAS BLACS MPI, PVM,... Serial Parallel 8 Data Partitioning • Initialize BLACS routines – – – – Set up 2-D mesh (mp, np) Set (CorR) Column or Row Major Order Call returns context handle (icon) Get row & column through gridinfo (myrow,mycol) • blacs_gridinit( icon, CorR, mp, np ) • blacs_gridinfo( icon, mp, np, myrow, mycol) 9 Data Partitioning (cont.) • Block-cyclic Array Distribution. – Block: (groups of sequential elements) – Cyclic: (round-robin assignment to PEs) • Example 1-D Partitioning: – Global 9-element array distributed on a 3-PE grid with 2 element blocks: block(2)-cyclic(3) 1 Block-Cyclic Partitioning Block-Cyclic Partitioning Block size = 2, Cyclic on 3-PE Grid Global Array 11 12 13 14 15 16 17 18 19 11 12 17 18 1 4 13 14 19 2 5 PE 0 PE 1 0 PE 2 Local Arrays Partioned Array often shown by 11 12 17 18 13 14 19 15 16 this type of grid map 0 Grid Row 15 16 3 1 2 Indicates sequence of cyclic distribution. 1 2-D Partitioning • Example 2-D Partitioning – A(9,9) mapped onto PE(2,3) grid block(2,2) – Use 1-D example to distribute column elements in each row, a block(2)-cyclic(3) distribution. – Assign rows to first PE dimension by block(2)cyclic(2) distribution. 1 11 21 31 41 51 61 71 81 91 11 21 51 61 91 12 22 52 62 92 17 27 57 67 97 18 28 58 68 98 12 22 32 42 52 62 72 82 92 13 14 15 16 17 18 19 Global Matrix 23 24 25 26 27 28 29 33 34 35 36 37 38 39 43 44 45 46 47 48 49 Global Matrix (9x9) 53 54 55 56 57 58 59 Block Size =2x2 63 64 65 66 67 68 69 Cyclic on 2x3 PE Grid 73 74 75 76 77 78 79 83 84 85 86 87 88 89 93 94 95 96 97 98 99 13 23 53 63 93 14 24 54 64 94 PE (0,0) 31 41 71 81 32 42 72 82 37 47 77 87 38 48 78 88 PE (1,0) 19 29 59 69 99 15 16 25 26 55 56 65 66 95 96 PE (0,1) 33 43 73 83 34 44 74 84 39 49 79 89 PE (1,1) PE (0,2) 35 45 75 85 36 46 76 86 PE (1,2) 1 11 21 51 61 91 12 22 52 62 92 17 27 57 67 97 18 28 58 68 98 13 23 53 63 93 14 24 54 64 94 PE (0,0) 31 41 71 81 32 42 72 82 37 47 77 87 38 48 78 88 PE (1,0) 19 29 59 69 99 15 16 25 26 55 56 65 66 95 96 PE (0,1) 33 43 73 83 34 44 74 84 39 49 79 89 PE (1,1) 0 0 PE (0,2) 35 45 75 85 36 46 76 86 1 PE (1,2) 11 21 51 61 91 31 41 71 81 12 22 52 62 92 32 42 72 82 1 17 27 57 67 97 37 47 77 87 18 28 58 68 98 38 48 78 88 13 23 53 63 93 33 43 73 83 14 19 24 29 54 59 64 69 94 99 34 39 44 49 74 79 84 89 2 15 16 25 26 55 56 65 66 95 96 35 36 45 46 75 76 85 86 Partitioned Array 2 x 3 Processor Grid 1 Syntax for descinit descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier) IorO OUT IN IN IN IN IN IN IN IN OUT arg idesc m n mb nb i j icon mla ier Description Descriptor Row Size (Global) Column Size (Global) Row Block Size Column Size Starting Grid Row Starting Grid Column BLACS context Leading Dimension of Local Matrix Error number The starting grid location is usually (i=0,j=0). 1 Application Program Interfaces (API) • Drivers – Solves a Complete Problem • Computational Components – Performs Tasks: LU factorization, etc. • Auxiliary Routines – Scaling, Matrix Norm, etc. • Matrix Redistribution/Copy Routine – Matrix on PE grid1 -> Matrix on PE grid2 1 API (cont..) • Prepend LAPACK equivalent names with P. PXYYZZZ Computation Performed Matrix Type Data Types Data Type X real double cmplx dble cmplx S D C Z 1 Matrix Types (YY) PXYYZZZ DB General Band (diagonally dominant-like) DT general Tridiagonal (Diagonally dominant-like) GB General Band GE GEneral matrices (e.g., unsymmetric, rectangular, etc.) GG General matrices, Generalized problem HE Complex Hermitian OR Orthogonal Real PB Positive definite Banded (symmetric or Hermitian) PO Positive definite (symmetric or Hermitian) PT Positive definite Tridiagonal (symmetric or Hermitian) ST Symmetric Tridiagonal Real SY SYmmetric TR TRiangular (possibly quasi-triangular) TZ TrapeZoidal UN UNitary complex 1 Drivers (ZZZ) Routine types PXYYZZZ SL Linear Equations (SVX*) SV LU Solver VD Singular Value EV Eigenvalue (EVX**) GVX Generalized Eigenvalue *Estimates condition numbers. *Provides Error Bounds. *Equilibrates System. **Selected Eigenvalues 1 Rules of Thumb • Use a square processor grid, mp=np • Use block size 32 (Cray) or 64 – for efficient matrix multiply • Number of PEs to use = MxN/10**6 • Bandwith/node (MB/sec) > peak rate (MFLOPS)/10 • Latency < 500 usecs • Use vendor’s BLAS 2 Example program: Linear Equation Solution Ax=b (page 1) program Sample_ScaLAPACK_program implicit none include "mpif.h" REAL :: A ( 1000 , 1000) REAL, DIMENSION(:,:), ALLOCATABLE :: A_local INTEGER, PARAMETER :: M_A=1000, N_A=1000 INTEGER :: DESC_A(1:9) REAL, DIMENSION(:,:), ALLOCATABLE :: b_local REAL :: b( 1000,1 ) INTEGER, PARAMETER :: M_B=1000, N_B=1 INTEGER :: DESC_b(1:9) ,i,j INTEGER :: ldb_y INTEGER :: ipiv(1:1000),info INTEGER :: myrow,mycol !row and column # of PE on PE grid INTEGER :: ictxt !context handle.....like MPIcommunicator INTEGER :: lda_y,lda_x ! leading dimension of local array A(@) ! row and column # of PE grid INTEGER, PARAMETER :: nprow=2, npcol=2 ! ! RSRC...the PE row that owns the first row of A ! CSRC...the PE column that owns the first column of A 2 Example program (page 2) INTEGER, PARAMETER :: RSRC = 0, CSRC = 0 INTEGER, PARAMETER :: MB = 32, NB=32 ! ! iA...the global row index of A, which points to the beginning !of submatrix that will be operated on. ! INTEGER, PARAMETER :: iA = 1, jA = 1 INTEGER, PARAMETER :: iB = 1, jB = 1 REAL :: starttime, endtime INTEGER :: taskid,numtask,ierr !********************************** ! NOTE: ON THE NPACI MACHINES DO NOT NEED TO CALL INITBUFF ! ON OTHER MACHINES YOU MAY NEED TO CALL INITBUFF !************************************* ! ! This is where you should define any other variables ! INTEGER, EXTERNAL :: NUMROC ! ! Initialization of the processor grid. This routine ! must be called ! ! start MPI initialization routines 2 Example program (page 3) CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD,taskid,ierr) CALL MPI_COMM_SIZE(MPI_COMM_WORLD,numtask,ierr) ! start timing call on 0,0 processor IF (taskid==0) THEN starttime =MPI_WTIME() END IF CALL BLACS_GET(0,0,ictxt) CALL BLACS_GRIDINIT(ictxt,'r',nprow,npcol) ! ! This call returns processor grid information that will be ! used to distribute the global array ! CALL BLACS_GRIDINFO(ictxt, nprow, npcol, myrow, mycol ) ! ! define lda_x and lda_y with numroc ! lda_x...the leading dimension of the local array storing the ! ........local blocks of the distributed matrix A ! lda_x = NUMROC(M_A,MB,myrow,0,nprow) lda_y = NUMROC(N_A,NB,mycol,0,npcol) ! ! resizing A so don't waste space ! 2 Example program (page 4) ALLOCATE( A_local (lda_x,lda_y)) if (N_B < NB )then ALLOCATE( b_local(lda_x,N_B) ) else ldb_y = NUMROC(N_B,NB,mycol,0,npcol) ALLOCATE( b_local(lda_x,ldb_y) ) end if ! ! defining global matrix A ! do i=1,1000 do j = 1, 1000 if (i==j) then A(i,j) = 1000.0*i*j else A(i,j) = 10.0*(i+j) end if end do end do ! defining the global array b do i = 1,1000 b(i,1) = 1.0 end do ! 2 Example program (page 5) ! subroutine setarray maps global array to local arrays ! ! YOU MUST HAVE DEFINED THE GLOBAL MATRIX BEFORE THIS POINT ! CALL SETARRAY(A,A_local,b,b_local,myrow,mycol , lda_x,lda_y ) ! ! initialize descriptor vector for scaLAPACK ! CALL DESCINIT(DESC_A,M_A,N_A,MB,NB,RSRC,CSRC,ictxt,lda_x,info) CALL DESCINIT(DESC_b,M_B,N_B,MB,N_B,RSRC,CSRC,ictxt,lda_x,info) ! ! call scaLAPACK routines ! CALL PSGESV( N_A,1,A_local,iA,jA, DESC_A,ipiv,b_local,iB,jB,DESC_b,info) ! ! check ScaLAPACK returned ok ! 2 Example program (page 6) if (info .ne. 0)then print *,'unsucessfull return...something is wrong! info=',info else print *,'Sucessfull return' end if CALL BLACS_GRIDEXIT( ictxt ) CALL BLACS_EXIT ( 0 ) ! call timing routine on processor 0,0 to get endtime IF (taskid==0) THEN endtime = MPI_WTIME() print*,'wall clock time in second = ',endtime - starttime END IF ! end MPI CALL MPI_FINALIZE(ierr) ! ! end of main ! END 2 Example program (page 7) SUBROUTINE SETARRAY(AA,A,BB,B,myrow,mycol,lda_x, lda_y ) ! reads in global matrix A_local and ! distributes to the local arrays. ! implicit none REAL :: AA(1000,1000) REAL :: BB(1000,1) INTEGER :: lda_x, lda_y REAL :: A(lda_x,lda_y) REAL ::ll,mm,Cr,Cc INTEGER :: ii,jj,I,J,myrow,mycol,pr,pc,h,g INTEGER , PARAMETER :: nprow = 2, npcol =2 INTEGER, PARAMETER ::N=1000,M=1000,NB=32,MB=32,RSRC=0,CSRC=0 REAL :: B(lda_x,1) INTEGER, PARAMETER :: N_B = 1 2 Example program (page 8) do I=1,M do J = 1,N ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! ! check if on this PE and then set A ! if (myrow ==pr .and. mycol==pc)then ! ii,jj coordinates of local array element ! ii = x + l*MB ! jj = y + m*NB ! ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) ii = mod(I-1,MB) + 1 + aint(ll)*MB jj = mod(J-1,NB) + 1 + aint(mm)*NB A(ii,jj) = AA(I,J) end if end do end do 2 Example Program (page 9) ! reading in and distributing B vector ! do I = 1, M J=1 ! finding out which PE gets this I,J element Cr = real( (I-1)/MB ) h = RSRC+aint(Cr) pr = mod( h,nprow) Cc = real( (J-1)/MB ) g = CSRC+aint(Cc) pc = mod(g,nprow) ! check if on this PE and then set A if (myrow ==pr .and. mycol==pc)then ll = real( ( (I-1)/(nprow*MB) ) ) mm = real( ( (J-1)/(npcol*NB) ) ) jj = mod(J-1,NB) + 1 + aint(mm)*NB ii = mod(I-1,MB) + 1 + aint(ll)*MB B(ii,jj) = BB(I,J) end if end do end subroutine 2 PxGEMM (mflops) N= 2000 4000 Cray 1055 1070 T3E 3630 4005 13456 14287 IBM 755 SP2 2514 2850 6205 8709 463 470 Now 926 1031 Berkeley 4130 5457 6000 8000 10000 2x2 1075 4258 4171 4292 4x4 15419 15858 16755 8x8 3040 9861 632 6041 2x2 4x4 10468 10774 8x8 2x2 4x4 1754 6360 6647 8x8 3 PxGESV (mflops) N= 2000 Cray 702 T3E 1508 2419 IBM 421 SP2 722 721 350 Now 811 Berkeley 1171 5000 884 2680 6912 603 1543 2084 75000 10000 15000 1x4 932 3218 3356 3602 2x8 9028 10299 12547 4x16 1903 2149 2837 3344 1x4 2x8 3963 4x16 1x4 1310 1472 1547 3091 3937 4263 4560 2x8 4x16 3