p3pack - PERMUTATION LIBRARY FOR MASPAR --------------------------------------Version 1.0, March 29-1993 -------------------------Author: -----Hans Munthe-Kaas, Department of computer science, University of Bergen, N-5020 Bergen. Email: hans@ii.uib.no Conditions for use: -----------------************************************************************************* ** * For non-commercial use and research, the codes are available free of * * charge. The author would appreciate hearing from users of the code. * * It is not allowed to use the code in commercial software without * * written permission from the author. * * The authors name should never be removed from the source code. * ************************************************************************* ** Reference: --------Munthe-Kaas H.: "Practical Parallel Permutation Procedures", to appear. Preprint available from the author. Purpose: ------The library is a collection of useful permutation procedures for MasPar. P3 is an abbreviation for: "Parallel Permutations Procedures". The current version of the library is based on socalled bit-linear-permutations. These can be used for such things as: - bit reversal - matrix transpositions - perfect shuffles - bit conjugations - matrix reshaping - flipping vectors and matrices - exchanging memory and processor bits All permutations are in-place, i.e. no extra work space is needed. The permutations are routed nearly optimally, i.e. after some time to compute a routing, the permutation executes at (close to) highest possible speed. The computation of the routing is not very expensive for the permutations in the present version of p3pack, but if the permutation is to be performed less than ~ 3-5 times, the routing time will dominate. It is possible to compute symbolically the product and inverses of permutations, i.e. if two permutations are to be executed consecutively, they can be symbolically merged into one, and executed at the same speed as one. For some problems it is not necessary to perform the routings, but rather to compute destination addresses for given permutations. Routines for this is supplied. It is the intention to extend the library with other types of permutations in the future, such as general permutations, upper- and lowertriangular admissable permutations and permutations connected with Abelian groups and rings (i.e. grid-shifts and grid-dilations). The library will perhaps be moved to other hardware platforms, and thus serve as a tool for writing portable data-parallel code. The current version of the library is callable from MasPar fortran and mpl. Limitations: ----------The current version of the library is only suited for permutations and reshapes of arrays where all dimensions are powers of 2. Definitions: ----------MasPar memory space: ------------------In this library the memory occupied by a plural array is regarded as contiguous address space of k+lnproc bits, where 2^k is the number of data items per processor, and 2^lnproc is the number of processors. The bits are always ranked in the following order: (m_k-1, m_k-2, ..., m_0, p_lnproc-1, ..., p_0) memory bits processor bits where the Most Significant Bit is to the left. The address may also be thought of as a binary column vector of dimension k+lnproc, with the most significant bit on the bottom: | | p_0 . | | | . | | p_lnproc-1 | | m_0 | | . | | . | | m_k-1 | When computing destinations, addresses are represented by unsigned integers (i.e. 32bits). Bit--Permutation (btprm): -----------------------A bit-permutation is a permutation moving data items from address (g_r-1, g_r-2, ..., g_0) to an address (s(g_r-1), s(g_r-2), ..., s(g_0)) where s() is a permutation of the bit-order. A bit permutation is specified by listing the destination of each bit, e.g.: (g_3, g_2, g_1, g_0) -> (g_2, g_1, g_0, g_3) is specified by an integer vector: btp[0] = 1; btp[1] = 2; btp[2] = 3; btp[3] = 0; (This is called a perfect shuffle). Many important permutations are bit-permutations, e.g. matrix transposition: (g_3, g_2, g_1, g_0) -> (g_1, g_0, g_3, g_2) and bit reversal: (g_3, g_2, g_1, g_0) -> (g_0, g_1, g_2, g_3). Another example is permutations changing arrays from cut-and-stack to hierarchical memory mappings. Example: On a 1024 proc MasPar, declare an array as plural int arr[16]; In this case the permutation: (g_13,g_12,g_11,g_10,g_9,...,g_0) -> (g_3,g_2,g_1,g_0,g_13,...,g_4) represents the permutation from cut-and-stack to hierarchical. Bit conjugation: --------------A conjugation of a bit is a permutation changing the value of a bit, i.e. cjug(0) = 1, cjug(1) = 0. A conjugation of all bits represents the flipping: arr(i) <-> arr(2^k-1-i) Bit--Linear--Permutation (blp): -----------------------------A permutation is bit linear if it can be written as : Matrix X Addressvector + conjugvector where matrix-vector products and vector additions are over GF(2), i.e. additions and multiplications modulo 2. Note that both bit-permutations and conjugations can be written as blp's, e.g. the perfect shuffle above can be represented with the product: | | | | 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 | | | | | x | | | g_0 g_1 g_2 g_3 | | | | | + | | | 0 0 0 0 | | | | | = | | | g_3 g_0 g_1 g_2 | | | | And bit conjucations can be specified by the right vector. An important property of the blp's is that they form a group, i.e. the composition of two blp's is a new blp, and the inverse of a blp is also a blp. Another example of a blp is the socalled binary reflected Gray-code mapping, which can be written as the product: | | | | 1 0 0 0 1 1 0 0 0 1 1 0 0 0 1 1 | | | | | x | | | g_0 g_1 g_2 g_3 | | | | | + | | | 0 0 0 0 | | | | Datastructures: -------------The following datastructures are used in the library: BL_PERMUT permutation. : Pointer to a structure containing a bit-linear PERMUT_ROUT : Pointer to a structure containing information about routing of the permutation. NOTE: Users should access and modify these structs via the routines provided in the library. The internal format of these structs may change in the future, and direct access to these structs may cause problems with future versions of the library. In the current version, these structs are defined as: typedef struct {plural unsigned short *mfch,*psto,*eswp; int nmem;} *PERMUT_ROUT, PERMUT_ROUT_STRUCT; and typedef struct {int nbits; unsigned *pmat;} *BL_PERMUT, BL_PERMUT_STRUCT; Usage: ----The routines and the datastructures are declared in the include file "p3pack.h" found in the 'include' directory. After running "make" the compiled library is found in "libp3pack.a" in the 'lib' directory. Synopsis: #include "p3pack.h" mpl sourceCode -lp3pack Description of the routines: --------------------------Here follows a description of the routines as seen from MPL. The fortran callable routines are described later. The routines are in 3 categories. 1) Routines working symbolically on bit-linear-permutations (blp). These are generally fast, and work on singular objects in the ACU. 2) Routines creating routing information (permRout) from blp. Routing information is stored in the DPU as plural data. These routines are the slowest routines, and should be used as seldom as possible, i.e. save and re-use routing information whenever possible. 3) Routines applying permutation information on a dataset to accomplish a permutation. These routines are as fast as the hardware allows. Routines of category 1: ---------------------BL_PERMUT blp_identNew(int nbits); Purpose: Create the identity btprm. Inputs: nbits : Size of matrix. Output: Create new identity blp and returns pointer to it. BL_PERMUT blp_btprmNew(int *btprm, int nbits); Purpose: Create a new blp from a btprm. Inputs: Output: btprm : integer vector defining the bit-permutation. nbits : number of bits in the vector. Create new blp and returns pointer to it. BL_PERMUT blp_iMatNew(int *A, int *b, int nbits); Purpose: Create a new blp from an integer matrix and integer vector. Inputs: A : integer matrix of size nbits x nbits. The matrix should only contain 0's and 1's. b : right hand side vector of size nbits. The vector should only contain 0's and 1's. nbits : Size of the matrix and the vector. Output: Create new blp and returns pointer to it. BL_PERMUT blp_cMatNew(char *A, char *b, int nbits); Purpose: Same as blp_iMatNew(). The only difference is that the input matrix and vector are of type (char *). BL_PERMUT blp_copyNew(BL_PERMUT P); Purpose: Create a new blp as an exact copy of another blp. Input: blp P. Output: Pointer to a copy of P. void blp_free(BL_PERMUT P); Purpose: Release the memory associated with a blp. Input: P pointer to the struct to be deleted. void blp_copy(BL_PERMUT A, BL_PERMUT B); Purpose: Copy the contents of a blp to an existing blp of the same size. Input: B blp to be copied. Update: A is changed to a copy of B. Limitations: Error if A and B are of different size, or if A is not created before the call to the routine. void blp_print(BL_PERMUT P); Purpose: Print the matrix and right vector in the blp in a nice format. Input: P pointer to the blp to be printed. Output: Printing contents of P to stdout. void blp_rMult(BL_PERMUT A, BL_PERMUT B); Purpose: Compute the product (i.e. the composition) of two blp's. If A and B is two blp's then AB is the blp obtained by FIRST executing B, and AFTERWARDS expecting A. Input: B blp. Update: A blp. A is updated to the new value AB. void blp_lMult(BL_PERMUT A, BL_PERMUT B); Purpose: The same as blp_rMult, but the multiplication is in the opposite order, i.e.B is multiplied from LEFT instead of right. Input: B blp. Update: A blp. A is updated to the new value BA. void blp_inv(BL_PERMUT P); Purpose: Compute the inverse of a blp. Input: P blp. Update: P blp, P is updated to the inverse permutation. plural unsigned* blp_DestAdMat(BL_PERMUT P); Purpose: Compute a matrix containing the destination address of each data object if it is permuted by blp P. Input: P blp. Output: The routine creates a plural unsigned matrix, fills it with destination addresses, and returns a pointer to it. unsigned blp_sDestAd(BL_PERMUT P, unsigned srcadr); Purpose: Compute the destination address, given the source address. Input: srcadr : source address. P : blp. Output: Returns destination address. plural unsigned blp_pDestAd(BL_PERMUT P, plural unsigned srcadr); Purpose: Plural version of blp_sDestAd. Input: srcadr : source address. P : blp. Output: Returns destination address. /* inline */ blp_DestAd(dest,P,src) Purpose: This is an 'inline' version of the routines blp_sDestAd() and blp_pDestAd(). It accepts both singular and plural src and dest. It is defined as a macro in the file "p3pack.h". Input: src : source address of type 'unsigned' or 'plural unsigned'. P : blp. Output: dest : destination address of same type as src. Note: This routine is about twice as fast as its cousins above. We recommend using this instead of blp_sDestAd() and blp_pDestAd(). Routines of category 2: ---------------------PERMUT_ROUT blp_rout(BL_PERMUT P); Purpose: Create routing information from a blp. Input: P blp. Output: Creates a struct containing the routing information, and returns a pointer to the struct. Limitations: Error message when the input matrix is singular. PERMUT_ROUT PxBlp_rout(BL_PERMUT P, plural unsigned prm(plural unsigned tadr, int mode, void *vp)); Purpose: Compute the routing of the product of a blp and an "admissable lower triangular" permutation. This routine will be documented in a later release of the software. void pmr_free(PERMUT_ROUT R); Purpose: Release the memory associated with a permRout. Input: R pointer to the struct to be deleted. int pmr_blpCheck(BL_PERMUT P, PERMUT_ROUT R); Purpose: Check a routing by comparing the result of a permutation with computed destination address. Input: P : blp, R : permRout. Output: 0 if success, 1 if failure. Note: This routine is intended for debugging. It is also useful for detecting hardware errors; make a very large permutation, and check if it is works correctly. Memory or router errors should show up. int pmr_PxBlpCheck(BL_PERMUT P, plural unsigned prm(plural unsigned tadr, int mode, void *vp), PERMUT_ROUT R); Purpose: Same as permRout_blpCheck() for routings produced by permRout_PxBlpNew(). Will be documented in a later release. Routines of category 3: ---------------------void permut(void *arr, int blksiz, int nblk, int direction, PERMUT_ROUT R); Purpose: Execute a permutation of an array arr, where each element in the array occupy blksiz*nblk bytes of space. Input: arr: array to be permuted. blksiz: size of blocks sent by the router in the permutation. This number must be 1, 2, 4 or 8. nblk: the number of blocks in a matrix element. direction: +1 for forward permutation. -1 for inverse. Note: Highest speed is achieved when blksiz is as large as possible. Example: If arr is declared as: plural float arr[arrsize]; then the following calls are equivalent: permut(arr,1,4,1,R); and permut(arr,4,1,1,R); although the latter is *considerably* faster. In the latter case, the routine assumes that the starting address is properly aligned for reading 'plural long' data. Generally this should not cause problems. The former call reads 'plural char' from memory, and is thus valid for all alignments. void permut32(void *arr, int dir, PERMUT_ROUT R); void permut64(void *arr, int dir, PERMUT_ROUT R); Purpose: Special versions of permut for permuting 32bit and 64bit data. Example: If arr is declared as: plural double arr[arrsize]; Then the following calls are equivalent: permut(arr,8,1,1,R); and permut64(arr,1,R); The latter is *slightly* faster, but the difference is in practice negligible, thus the routines permut32() and permut64() can in practice always be replaced by permut(). Calling p3pack from Fortran (HPF) --------------------------------Although the p3pack subroutines are written in MPL, an interface to Fortran is provided. An Fortran 90 or High Performance Fortran compiler is required, i.e. either MasPar Fortran or DECmpp HPF. Some of the MPL routines described above cannot be called from Fortran simply because that would not be useful (please contact the author if you disagree). Since the calling sequences and parameter types are slightly different from the MPL version, a complete description for Fortran use is included below. Compiler directives: All p3pack subroutines that are called from Fortran must have been declared as MPL subroutines by the compiler directive CMPF MPL subroutine1 subroutine2 ... Compiling and linking: mpfortran [options] sourceCode -lp3pack Description of the routines: --------------------------The routines are in 3 categories. 1) Routines working symbolically on bit-linear-permutations (blp). These are generally fast, and work on singular objects in the ACU. 2) Routines creating routing information (permRout) from blp. Routing information is stored in the DPU as plural data. These routines are the slowest routines, and should be used as seldom as possible, i.e. save and re-use routing information whenever possible. 3) Routines applying permutation information on a dataset to accomplish a permutation. These routines are as fast as the hardware allows. Routines of category 1: ---------------------subroutine blp_identNew(blptr, nbits) integer blptr, nbits Purpose: Create the identity btprm. Inputs: nbits : Size of matrix. Output: blptr : a pointer to a new identity blp. subroutine blp_btprmNew(blptr, btprm, nbits) integer blptr, nbits integer btprm(nbits) Purpose: Create a new blp from a btprm. Inputs: btprm : integer vector defining the bit-permutation. nbits : number of bits in the vector. Output: blptr : a pointer to a new blp. Note : The vector btprm *must* be stored on the front end. You may want to use the compiler directive CMPF ONFE btprm to ensure this. subroutine blp_iMatNew(blptr, A, b, nbits) integer blptr, nbits integer A(nbits,nbits), b(nbits) Purpose: Create a new blp from an integer matrix and integer vector. Inputs: A : integer matrix of size nbits x nbits. The matrix should only contain 0's and 1's. b : right hand side vector of size nbits. The vector should only contain 0's and 1's. Output: Note : nbits : Size of the matrix and the vector. blptr : a pointer to a new blp. The arrays A and b *must* be stored on the front end. You may want to use the compiler directive CMPF ONFE A, b to ensure this. subroutine blp_copyNew(blptr, p) integer blptr, p Purpose: Create a new blp as an exact copy of another blp. Input: p : pointer to the blp to be copied. Output: blptr : pointer to the copy. subroutine blp_free(p) integer p Purpose: Release the memory associated with a blp. Input: p : pointer to the struct to be deleted. subroutine blp_copy(A, B) integer A, B Purpose: Copy the contents of a blp to an existing blp of the same size. Input: B : blp to be copied. Update: A is changed to a copy of B. Limitations: Error if A and B are of different size, or if A is not created before the call to the routine. subroutine blp_print(P); integer P Purpose: Print the matrix and right vector in the blp in a nice format. Input: P blp. Output: Printing contents of P to stdout. subroutine blp_rMult(A, B) integer A, B Purpose: Compute the product (i.e. the composition) of two blp's. If A and B are two blp's then AB is the blp obtained by FIRST executing B, and AFTERWARDS expecting A. Input: A and B, blp. Update: A blp. A is updated to the new value AB. subroutine blp_lMult(A, B) integer A, B Purpose: The same as blp_rMult, but the multiplication is in the opposite order, i.e.B is multiplied from LEFT instead of right. Input: A and B, blp. Update: A blp. A is updated to the new value BA. subroutine blp_inv(P) integer P Purpose: Compute the inverse of a blp. Input: P blp. Update: P blp, P is updated to the inverse permutation. subroutine blp_sDestAd(dstadr, P, srcadr) integer P integer dstadr, srcadr Purpose: Compute the destination address, given the source address. Input: srcadr : source address. P : blp. Output: dstadr : destination address. subroutine blp_pDestAd(dstadr, P, srcadr) integer P integer dstadr(nproc), srcadr(nproc) Purpose: Compute a set of destination address, given the source addresses. Input: srcadr : source addresses. P : blp. Output: dstadr : destination addresses. Note: The adresses dstadr and srcadr are arrays that *must* be stored on the DPU. You may want to use the compiler directive CMPF ONDPU dstadr, srcadr to ensure this. Routines of category 2: ---------------------subroutine blp_rout(R, A) integer R, A Purpose: Create routing information from a blp. Input: P blp. Output: Creates a struct containing the routing information, and returns R; a pointer to the struct. Limitations: Error message when the input matrix is singular. subroutine pmr_free(R) integer R Purpose: Release the memory associated with a permRout. Input: R pointer to the struct to be deleted. subroutine pmr_blpCheck(err, P, R) Purpose: Check a routing by comparing the result of a permutation with computed destination address. Input: P : blp, R : permRout. Output: ierr: 0 if success, 1 if failure. Note: This routine is intended for debugging. It is also useful for detecting hardware errors; make a very large permutation, and check if it is works correctly. Memory or router errors should show up. Routines of category 3: ---------------------- subroutine permut(arr, blksiz, nblk, dir, R) integer blksiz, nblk, dir, R <any type> arr(*) Purpose: Execute a permutation of an array arr, where each element in the array occupy blksiz*nblk bytes of space. Input: arr: array to be permuted. blksiz: size of blocks sent by the router in the permutation. This number must be 1, 2, 4 or 8. nblk: the number of blocks in a matrix element. dir: +1 for forward permutation. -1 for inverse. Note: The array arr *must* be stored on the DPU. You may want to use the compiler directive CMPF ONDPU arr to ensure this. Comment: Highest speed is achieved when blksiz is as large as possible. Example: If arr is declared as: real arr(arrsize) then the following calls are equivalent: call permut(arr,1,4,1,R) and call permut(arr,4,1,1,R) although the latter is *considerably* faster. In the latter case, the routine assumes that the starting address is properly aligned for reading 'plural long' data. Generally this should not cause problems. The former call reads 'plural char' from memory, and is thus valid for all alignments. permut32(arr, dir, R) integer dir, R <any type> arr(*) permut64(arr, dir, R) integer dir, R <any type> arr(*) Purpose: Special versions of permut for permuting 32bit and 64bit data. Example: If arr is declared as: double precision arr(arrsize) Then the following calls are equivalent: call permut(arr,8,1,1,R); and call permut64(arr,1,R); The latter is *slightly* faster, but the difference is in practice negligible, thus the routines permut32() and permut64() can in practice always be replaced by permut().