PowerPoint

advertisement
Introduction to ScaLAPACK
ScaLAPACK
• ScaLAPACK is for solving dense linear
systems and computing eigenvalues for
dense matrices
2
ScaLAPACK
• Scalable Linear Algebra PACKage
• Developing team from University of
Tennessee, University of California Berkeley,
ORNL, Rice U.,UCLA, UIUC etc.
• Support in Commercial Packages
–
–
–
–
–
NAG Parallel Library
IBM PESSL
CRAY Scientific Library
VNI IMSL
Fujitsu, HP/Convex, Hitachi, NEC
3
Description
• A Scalable Linear Algebra Package
(ScaLAPACK)
• A library of linear algebra routines for MPP
and NOW machines
• Language : Fortran
• Dense Matrix Problem Solvers
–
–
–
Linear Equations
Least Squares
Eigenvalue
4
Technical Information
• ScaLAPACK web page
– http://www.netlib.org/scalapack
• ScaLAPACK User’s Guide
• Technical Working Notes
– http://www.netlib.org/lapack/lawn
5
Packages
• LAPACK
– Routines are single-PE Linear Algebra solvers,
utilities, etc.
• BLAS
– Single-PE processor work-horse routines
(vector, vector-matrix & matrix-matrix)
• PBLAS
– Parallel counterparts of BLAS. Relies on
BLACS for blocking and moving data.
6
Packages (cont.)
• BLACS
– Constructs 1-D & 2-D processor grids for
matrix decomposition and nearest neighbor
communication patterns. Uses message passing
libraries (MPI, PVM, MPL, NX, etc) for data
movement.
7
Dependencies & Parallelism
of Component Packages
ScaLAPACK
LAPACK
PBLAS
BLAS
BLACS
MPI, PVM,...
Serial
Parallel
8
Data Partitioning
• Initialize BLACS routines
–
–
–
–
Set up 2-D mesh (mp, np)
Set (CorR) Column or Row Major Order
Call returns context handle (icon)
Get row & column through gridinfo
(myrow,mycol)
• blacs_gridinit( icon, CorR, mp, np )
• blacs_gridinfo( icon, mp, np, myrow,
mycol)
9
Data Partitioning (cont.)
• Block-cyclic Array Distribution.
– Block: (groups of sequential elements)
– Cyclic: (round-robin assignment to PEs)
• Example 1-D Partitioning:
– Global 9-element array distributed on a 3-PE
grid with 2 element blocks: block(2)-cyclic(3)
1
Block-Cyclic Partitioning
Block-Cyclic Partitioning
Block size = 2, Cyclic on 3-PE Grid
Global Array
11 12 13 14 15 16 17 18 19
11 12 17 18
1
4
13 14 19
2
5
PE 0
PE 1
0
PE 2
Local
Arrays
Partioned Array
often shown by
11 12 17 18 13 14 19 15 16 this type of grid
map
0
Grid Row
15 16
3
1
2
Indicates sequence of cyclic distribution.
1
2-D Partitioning
• Example 2-D Partitioning
– A(9,9) mapped onto PE(2,3) grid block(2,2)
– Use 1-D example to distribute column elements
in each row, a block(2)-cyclic(3) distribution.
– Assign rows to first PE dimension by block(2)cyclic(2) distribution.
1
11
21
31
41
51
61
71
81
91
11
21
51
61
91
12
22
52
62
92
17
27
57
67
97
18
28
58
68
98
12
22
32
42
52
62
72
82
92
13 14 15 16 17 18 19 Global Matrix
23 24 25 26 27 28 29
33 34 35 36 37 38 39
43 44 45 46 47 48 49
Global Matrix (9x9)
53 54 55 56 57 58 59
Block Size =2x2
63 64 65 66 67 68 69
Cyclic on 2x3 PE Grid
73 74 75 76 77 78 79
83 84 85 86 87 88 89
93 94 95 96 97 98 99
13
23
53
63
93
14
24
54
64
94
PE (0,0)
31
41
71
81
32
42
72
82
37
47
77
87
38
48
78
88
PE (1,0)
19
29
59
69
99
15 16
25 26
55 56
65 66
95 96
PE (0,1)
33
43
73
83
34
44
74
84
39
49
79
89
PE (1,1)
PE (0,2)
35
45
75
85
36
46
76
86
PE (1,2)
1
11
21
51
61
91
12
22
52
62
92
17
27
57
67
97
18
28
58
68
98
13
23
53
63
93
14
24
54
64
94
PE (0,0)
31
41
71
81
32
42
72
82
37
47
77
87
38
48
78
88
PE (1,0)
19
29
59
69
99
15 16
25 26
55 56
65 66
95 96
PE (0,1)
33
43
73
83
34
44
74
84
39
49
79
89
PE (1,1)
0
0
PE (0,2)
35
45
75
85
36
46
76
86
1
PE (1,2)
11
21
51
61
91
31
41
71
81
12
22
52
62
92
32
42
72
82
1
17
27
57
67
97
37
47
77
87
18
28
58
68
98
38
48
78
88
13
23
53
63
93
33
43
73
83
14 19
24 29
54 59
64 69
94 99
34 39
44 49
74 79
84 89
2
15 16
25 26
55 56
65 66
95 96
35 36
45 46
75 76
85 86
Partitioned
Array
2 x 3 Processor Grid
1
Syntax for descinit
descinit(idesc, m,n, mb,nb, i,j, icon, mla, ier)
IorO
OUT
IN
IN
IN
IN
IN
IN
IN
IN
OUT
arg
idesc
m
n
mb
nb
i
j
icon
mla
ier
Description
Descriptor
Row Size (Global)
Column Size (Global)
Row Block Size
Column Size
Starting Grid Row
Starting Grid Column
BLACS context
Leading Dimension of Local Matrix
Error number
The starting grid location is usually (i=0,j=0).
1
Application Program Interfaces (API)
• Drivers
– Solves a Complete Problem
• Computational Components
– Performs Tasks: LU factorization, etc.
• Auxiliary Routines
– Scaling, Matrix Norm, etc.
• Matrix Redistribution/Copy Routine
– Matrix on PE grid1 -> Matrix on PE grid2
1
API (cont..)
• Prepend LAPACK equivalent names with P.
PXYYZZZ
Computation Performed
Matrix Type
Data Types
Data Type
X
real double cmplx dble cmplx
S
D
C
Z
1
Matrix Types (YY)
PXYYZZZ
DB General Band (diagonally dominant-like)
DT general Tridiagonal (Diagonally dominant-like)
GB General Band
GE GEneral matrices (e.g., unsymmetric, rectangular, etc.)
GG General matrices, Generalized problem
HE Complex Hermitian
OR Orthogonal Real
PB Positive definite Banded (symmetric or Hermitian)
PO Positive definite (symmetric or Hermitian)
PT Positive definite Tridiagonal (symmetric or Hermitian)
ST Symmetric Tridiagonal Real
SY SYmmetric
TR TRiangular (possibly quasi-triangular)
TZ TrapeZoidal
UN UNitary complex
1
Drivers (ZZZ)
Routine types
PXYYZZZ
SL
Linear Equations (SVX*)
SV
LU Solver
VD
Singular Value
EV
Eigenvalue (EVX**)
GVX Generalized Eigenvalue
*Estimates condition numbers.
*Provides Error Bounds.
*Equilibrates System.
**Selected Eigenvalues
1
Rules of Thumb
• Use a square processor grid, mp=np
• Use block size 32 (Cray) or 64
– for efficient matrix multiply
• Number of PEs to use = MxN/10**6
• Bandwith/node (MB/sec) >
peak rate (MFLOPS)/10
• Latency < 500 usecs
• Use vendor’s BLAS
2
Example program: Linear Equation Solution Ax=b (page 1)
program Sample_ScaLAPACK_program
implicit none
include "mpif.h"
REAL :: A ( 1000 , 1000)
REAL, DIMENSION(:,:), ALLOCATABLE :: A_local
INTEGER, PARAMETER :: M_A=1000, N_A=1000
INTEGER :: DESC_A(1:9)
REAL, DIMENSION(:,:), ALLOCATABLE :: b_local
REAL :: b( 1000,1 )
INTEGER, PARAMETER :: M_B=1000, N_B=1
INTEGER :: DESC_b(1:9) ,i,j
INTEGER :: ldb_y
INTEGER :: ipiv(1:1000),info
INTEGER :: myrow,mycol
!row and column # of PE on PE grid
INTEGER :: ictxt !context handle.....like MPIcommunicator
INTEGER :: lda_y,lda_x ! leading dimension of local array A(@)
! row and column # of PE grid
INTEGER, PARAMETER :: nprow=2, npcol=2
!
! RSRC...the PE row that owns the first row of A
! CSRC...the PE column that owns the first column of A
2
Example program (page 2)
INTEGER, PARAMETER :: RSRC = 0, CSRC = 0
INTEGER, PARAMETER :: MB = 32, NB=32
!
! iA...the global row index of A, which points to the beginning
!of submatrix that will be operated on.
!
INTEGER, PARAMETER :: iA = 1, jA = 1
INTEGER, PARAMETER :: iB = 1, jB = 1
REAL :: starttime, endtime
INTEGER :: taskid,numtask,ierr
!**********************************
! NOTE: ON THE NPACI MACHINES DO NOT NEED TO CALL INITBUFF
! ON OTHER MACHINES YOU MAY NEED TO CALL INITBUFF
!*************************************
!
! This is where you should define any other variables
!
INTEGER, EXTERNAL :: NUMROC
!
! Initialization of the processor grid. This routine
! must be called
!
! start MPI initialization routines
2
Example program (page 3)
CALL MPI_INIT(ierr)
CALL MPI_COMM_RANK(MPI_COMM_WORLD,taskid,ierr)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD,numtask,ierr)
! start timing call on 0,0 processor
IF (taskid==0) THEN
starttime =MPI_WTIME()
END IF
CALL BLACS_GET(0,0,ictxt)
CALL BLACS_GRIDINIT(ictxt,'r',nprow,npcol)
!
! This call returns processor grid information that will be
! used to distribute the global array
!
CALL BLACS_GRIDINFO(ictxt, nprow, npcol, myrow, mycol )
!
! define lda_x and lda_y with numroc
! lda_x...the leading dimension of the local array storing the
! ........local blocks of the distributed matrix A
!
lda_x = NUMROC(M_A,MB,myrow,0,nprow)
lda_y = NUMROC(N_A,NB,mycol,0,npcol)
!
! resizing A so don't waste space
!
2
Example program (page 4)
ALLOCATE( A_local (lda_x,lda_y))
if (N_B < NB )then
ALLOCATE( b_local(lda_x,N_B) )
else
ldb_y = NUMROC(N_B,NB,mycol,0,npcol)
ALLOCATE( b_local(lda_x,ldb_y) )
end if
!
! defining global matrix A
!
do i=1,1000
do j = 1, 1000
if (i==j) then
A(i,j) = 1000.0*i*j
else
A(i,j) = 10.0*(i+j)
end if
end do
end do
! defining the global array b
do i = 1,1000
b(i,1) = 1.0
end do
!
2
Example program (page 5)
! subroutine setarray maps global array to local arrays
!
! YOU MUST HAVE DEFINED THE GLOBAL MATRIX BEFORE THIS POINT
!
CALL SETARRAY(A,A_local,b,b_local,myrow,mycol , lda_x,lda_y )
!
! initialize descriptor vector for scaLAPACK
!
CALL DESCINIT(DESC_A,M_A,N_A,MB,NB,RSRC,CSRC,ictxt,lda_x,info)
CALL DESCINIT(DESC_b,M_B,N_B,MB,N_B,RSRC,CSRC,ictxt,lda_x,info)
!
! call scaLAPACK routines
!
CALL PSGESV( N_A,1,A_local,iA,jA,
DESC_A,ipiv,b_local,iB,jB,DESC_b,info)
!
! check ScaLAPACK returned ok
!
2
Example program (page 6)
if (info .ne. 0)then
print *,'unsucessfull return...something is wrong!
info=',info
else
print *,'Sucessfull return'
end if
CALL BLACS_GRIDEXIT( ictxt )
CALL BLACS_EXIT ( 0 )
! call timing routine on processor 0,0 to get endtime
IF (taskid==0) THEN
endtime = MPI_WTIME()
print*,'wall clock time in second = ',endtime - starttime
END IF
! end MPI
CALL MPI_FINALIZE(ierr)
!
! end of main
!
END
2
Example program (page 7)
SUBROUTINE SETARRAY(AA,A,BB,B,myrow,mycol,lda_x, lda_y )
! reads in global matrix A_local and
! distributes to the local arrays.
!
implicit none
REAL :: AA(1000,1000)
REAL :: BB(1000,1)
INTEGER :: lda_x, lda_y
REAL :: A(lda_x,lda_y)
REAL ::ll,mm,Cr,Cc
INTEGER :: ii,jj,I,J,myrow,mycol,pr,pc,h,g
INTEGER , PARAMETER :: nprow = 2, npcol =2
INTEGER, PARAMETER ::N=1000,M=1000,NB=32,MB=32,RSRC=0,CSRC=0
REAL :: B(lda_x,1)
INTEGER, PARAMETER :: N_B = 1
2
Example program (page 8)
do I=1,M
do J = 1,N
! finding out which PE gets this I,J element
Cr = real( (I-1)/MB )
h = RSRC+aint(Cr)
pr = mod( h,nprow)
Cc = real( (J-1)/MB )
g = CSRC+aint(Cc)
pc = mod(g,nprow)
!
! check if on this PE and then set A
!
if (myrow ==pr .and. mycol==pc)then
! ii,jj coordinates of local array element
! ii = x + l*MB
! jj = y + m*NB
!
ll = real( ( (I-1)/(nprow*MB) ) )
mm = real( ( (J-1)/(npcol*NB) ) )
ii = mod(I-1,MB) + 1 + aint(ll)*MB
jj = mod(J-1,NB) + 1 + aint(mm)*NB
A(ii,jj) = AA(I,J)
end if
end do
end do
2
Example Program (page 9)
! reading in and distributing B vector
!
do I = 1, M
J=1
! finding out which PE gets this I,J element
Cr = real( (I-1)/MB )
h = RSRC+aint(Cr)
pr = mod( h,nprow)
Cc = real( (J-1)/MB )
g = CSRC+aint(Cc)
pc = mod(g,nprow)
! check if on this PE and then set A
if (myrow ==pr .and. mycol==pc)then
ll = real( ( (I-1)/(nprow*MB) ) )
mm = real( ( (J-1)/(npcol*NB) ) )
jj = mod(J-1,NB) + 1 + aint(mm)*NB
ii = mod(I-1,MB) + 1 + aint(ll)*MB
B(ii,jj) = BB(I,J)
end if
end do
end subroutine
2
PxGEMM (mflops)
N= 2000
4000
Cray 1055 1070
T3E 3630 4005
13456 14287
IBM 755
SP2 2514 2850
6205 8709
463 470
Now
926 1031
Berkeley
4130 5457
6000
8000
10000
2x2
1075
4258 4171 4292 4x4
15419 15858 16755 8x8
3040
9861
632
6041
2x2
4x4
10468 10774 8x8
2x2
4x4
1754
6360 6647 8x8
3
PxGESV (mflops)
N= 2000
Cray 702
T3E 1508
2419
IBM 421
SP2 722
721
350
Now
811
Berkeley
1171
5000
884
2680
6912
603
1543
2084
75000
10000
15000
1x4
932
3218 3356 3602 2x8
9028 10299 12547 4x16
1903 2149
2837 3344
1x4
2x8
3963 4x16
1x4
1310 1472 1547
3091 3937 4263 4560 2x8
4x16
3
Download