p3pack

advertisement
p3pack - PERMUTATION LIBRARY FOR MASPAR
--------------------------------------Version 1.0, March 29-1993
-------------------------Author:
-----Hans Munthe-Kaas,
Department of computer science, University of Bergen, N-5020 Bergen.
Email: hans@ii.uib.no
Conditions for use:
-----------------*************************************************************************
**
* For non-commercial use and research, the codes are available free of
*
* charge. The author would appreciate hearing from users of the code.
*
* It is not allowed to use the code in commercial software without
*
* written permission from the author.
*
* The authors name should never be removed from the source code.
*
*************************************************************************
**
Reference:
--------Munthe-Kaas H.: "Practical Parallel Permutation Procedures", to appear.
Preprint available from the author.
Purpose:
------The library is a collection of useful permutation procedures for MasPar.
P3 is an abbreviation for: "Parallel Permutations Procedures".
The current version of the library is based on socalled
bit-linear-permutations. These can be used for such things as:
- bit reversal
- matrix transpositions
- perfect shuffles
- bit conjugations
- matrix reshaping
- flipping vectors and matrices
- exchanging memory and processor bits
All permutations are in-place, i.e. no extra work space is needed. The
permutations are routed nearly optimally, i.e. after some time to compute
a
routing, the permutation executes at (close to) highest possible
speed. The computation of the routing is not very expensive for the
permutations in the present version of p3pack, but if the
permutation is to be performed less than ~ 3-5 times, the routing time
will dominate.
It is possible to compute symbolically the product and
inverses of permutations, i.e. if two permutations are to be executed
consecutively, they can be symbolically merged into one, and executed at
the same speed as one.
For some problems it is not necessary to perform the routings, but
rather to compute destination addresses for given permutations.
Routines for this is supplied.
It is the intention to extend the library with other types of
permutations
in the future, such as general permutations, upper- and lowertriangular admissable permutations and permutations connected with
Abelian groups and rings (i.e. grid-shifts and grid-dilations).
The library will perhaps be moved to other hardware platforms, and thus
serve
as a tool for writing portable data-parallel code.
The current version of the library is callable from MasPar fortran and
mpl.
Limitations:
----------The current version of the library is only suited for permutations and
reshapes of arrays where all dimensions are powers of 2.
Definitions:
----------MasPar memory space:
------------------In this library the memory occupied by a plural
array is regarded as contiguous address space of k+lnproc bits,
where 2^k is the number of data items per processor, and 2^lnproc
is the
number of processors. The bits are always ranked in the
following order:
(m_k-1, m_k-2, ..., m_0, p_lnproc-1, ..., p_0)
memory bits
processor bits
where the Most Significant Bit is to the left. The address may
also be thought of as a binary column vector of dimension k+lnproc,
with the most significant bit on the bottom:
|
|
p_0
.
|
|
|
.
|
| p_lnproc-1 |
|
m_0
|
|
.
|
|
.
|
|
m_k-1
|
When computing destinations, addresses are represented by
unsigned integers (i.e. 32bits).
Bit--Permutation (btprm):
-----------------------A bit-permutation is a permutation moving data items from address
(g_r-1, g_r-2, ..., g_0)
to an address
(s(g_r-1), s(g_r-2), ..., s(g_0))
where s() is a permutation of the bit-order. A bit permutation is
specified by listing the destination of each bit, e.g.:
(g_3, g_2, g_1, g_0) -> (g_2, g_1, g_0, g_3)
is specified by an integer vector:
btp[0] = 1; btp[1] = 2; btp[2] = 3; btp[3] = 0;
(This is called a perfect shuffle).
Many important permutations are bit-permutations,
e.g. matrix transposition:
(g_3, g_2, g_1, g_0) -> (g_1, g_0, g_3, g_2)
and bit reversal:
(g_3, g_2, g_1, g_0) -> (g_0, g_1, g_2, g_3).
Another example is permutations changing arrays from cut-and-stack
to
hierarchical memory mappings.
Example: On a 1024 proc MasPar, declare an array as
plural int arr[16];
In this case the permutation:
(g_13,g_12,g_11,g_10,g_9,...,g_0) ->
(g_3,g_2,g_1,g_0,g_13,...,g_4)
represents the permutation from cut-and-stack to hierarchical.
Bit conjugation:
--------------A conjugation of a bit is a permutation changing the value of a
bit, i.e. cjug(0) = 1, cjug(1) = 0. A conjugation of all bits
represents the flipping:
arr(i) <-> arr(2^k-1-i)
Bit--Linear--Permutation (blp):
-----------------------------A permutation is bit linear if it can be written as :
Matrix X Addressvector + conjugvector
where matrix-vector products and vector additions are over GF(2),
i.e. additions and multiplications modulo 2.
Note that both bit-permutations and conjugations can be written as
blp's, e.g. the perfect shuffle above can be represented with the
product:
|
|
|
|
0
1
0
0
0
0
1
0
0
0
0
1
1
0
0
0
|
|
|
|
| x |
|
|
g_0
g_1
g_2
g_3
|
|
|
|
| + |
|
|
0
0
0
0
|
|
|
|
| = |
|
|
g_3
g_0
g_1
g_2
|
|
|
|
And bit conjucations can be specified by the right vector.
An important property of the blp's is that they form a group, i.e.
the composition of two blp's is a new blp, and the inverse of a
blp is also a blp.
Another example of a blp is the socalled binary reflected Gray-code
mapping, which can be written as the product:
|
|
|
|
1
0
0
0
1
1
0
0
0
1
1
0
0
0
1
1
|
|
|
|
| x |
|
|
g_0
g_1
g_2
g_3
|
|
|
|
| + |
|
|
0
0
0
0
|
|
|
|
Datastructures:
-------------The following datastructures are used in the library:
BL_PERMUT
permutation.
: Pointer to a structure containing a bit-linear
PERMUT_ROUT : Pointer to a structure containing information about
routing of the permutation.
NOTE: Users should access and modify these structs via the routines
provided in the library. The internal format of these structs may
change
in the future, and direct access to these structs may cause
problems
with future versions of the library.
In the current version, these structs are defined as:
typedef struct {plural unsigned short *mfch,*psto,*eswp; int nmem;}
*PERMUT_ROUT, PERMUT_ROUT_STRUCT;
and
typedef struct {int nbits; unsigned *pmat;} *BL_PERMUT,
BL_PERMUT_STRUCT;
Usage:
----The routines and the datastructures are declared in the include file
"p3pack.h" found in the 'include' directory. After running "make" the
compiled library is found in "libp3pack.a" in the 'lib' directory.
Synopsis:
#include "p3pack.h"
mpl sourceCode -lp3pack
Description of the routines:
--------------------------Here follows a description of the routines as seen from MPL. The
fortran callable routines are described later.
The routines are in 3 categories.
1) Routines working symbolically on bit-linear-permutations (blp).
These are generally fast, and work on singular objects in the ACU.
2) Routines creating routing information (permRout) from blp. Routing
information is stored in the DPU as plural data. These routines
are the slowest routines, and should be used as seldom as possible,
i.e. save and re-use routing information whenever possible.
3) Routines applying permutation information on a dataset to
accomplish a permutation. These routines are as fast as the
hardware
allows.
Routines of category 1:
---------------------BL_PERMUT
blp_identNew(int nbits);
Purpose: Create the identity btprm.
Inputs: nbits : Size of matrix.
Output: Create new identity blp and returns pointer to it.
BL_PERMUT
blp_btprmNew(int *btprm, int nbits);
Purpose: Create a new blp from a btprm.
Inputs:
Output:
btprm : integer vector defining the bit-permutation.
nbits : number of bits in the vector.
Create new blp and returns pointer to it.
BL_PERMUT
blp_iMatNew(int *A, int *b, int nbits);
Purpose: Create a new blp from an integer matrix and integer vector.
Inputs: A : integer matrix of size nbits x nbits. The matrix should
only contain 0's and 1's.
b : right hand side vector of size nbits. The vector should
only contain 0's and 1's.
nbits : Size of the matrix and the vector.
Output: Create new blp and returns pointer to it.
BL_PERMUT
blp_cMatNew(char *A, char *b, int nbits);
Purpose: Same as blp_iMatNew(). The only difference is that the input
matrix and vector are of type (char *).
BL_PERMUT
blp_copyNew(BL_PERMUT P);
Purpose: Create a new blp as an exact copy of another blp.
Input:
blp P.
Output: Pointer to a copy of P.
void
blp_free(BL_PERMUT P);
Purpose: Release the memory associated with a blp.
Input:
P pointer to the struct to be deleted.
void
blp_copy(BL_PERMUT A, BL_PERMUT B);
Purpose: Copy the contents of a blp to an existing blp of the same
size.
Input:
B blp to be copied.
Update: A is changed to a copy of B.
Limitations: Error if A and B are of different size, or if A is not
created before the call to the routine.
void
blp_print(BL_PERMUT P);
Purpose: Print the matrix and right vector in the blp in a nice
format.
Input:
P pointer to the blp to be printed.
Output: Printing contents of P to stdout.
void
blp_rMult(BL_PERMUT A, BL_PERMUT B);
Purpose: Compute the product (i.e. the composition) of two blp's.
If A and B is two blp's then AB is the blp obtained by
FIRST executing B, and AFTERWARDS expecting A.
Input:
B blp.
Update: A blp. A is updated to the new value AB.
void
blp_lMult(BL_PERMUT A, BL_PERMUT B);
Purpose: The same as blp_rMult, but the multiplication is in the
opposite order, i.e.B is multiplied from LEFT instead of
right.
Input:
B blp.
Update: A blp. A is updated to the new value BA.
void
blp_inv(BL_PERMUT P);
Purpose: Compute the inverse of a blp.
Input:
P blp.
Update: P blp, P is updated to the inverse permutation.
plural unsigned*
blp_DestAdMat(BL_PERMUT P);
Purpose: Compute a matrix containing the destination address of each
data object if it is permuted by blp P.
Input:
P blp.
Output: The routine creates a plural unsigned matrix, fills it with
destination addresses, and returns a pointer to it.
unsigned
blp_sDestAd(BL_PERMUT P, unsigned srcadr);
Purpose: Compute the destination address, given the source address.
Input:
srcadr : source address.
P : blp.
Output: Returns destination address.
plural unsigned
blp_pDestAd(BL_PERMUT P, plural unsigned srcadr);
Purpose: Plural version of blp_sDestAd.
Input:
srcadr : source address.
P : blp.
Output: Returns destination address.
/* inline */
blp_DestAd(dest,P,src)
Purpose: This is an 'inline' version of the routines blp_sDestAd() and
blp_pDestAd(). It accepts both singular and plural src and
dest. It is defined as a macro in the file "p3pack.h".
Input:
src : source address of type 'unsigned' or 'plural unsigned'.
P : blp.
Output: dest : destination address of same type as src.
Note:
This routine is about twice as fast as its cousins above.
We recommend using this instead of blp_sDestAd() and
blp_pDestAd().
Routines of category 2:
---------------------PERMUT_ROUT
blp_rout(BL_PERMUT P);
Purpose: Create routing information from a blp.
Input:
P blp.
Output: Creates a struct containing the routing information, and
returns a pointer to the struct.
Limitations: Error message when the input matrix is singular.
PERMUT_ROUT
PxBlp_rout(BL_PERMUT P,
plural unsigned prm(plural unsigned tadr, int mode, void
*vp));
Purpose: Compute the routing of the product of a blp and an
"admissable lower triangular" permutation. This routine will
be
documented in a later release of the software.
void
pmr_free(PERMUT_ROUT R);
Purpose: Release the memory associated with a permRout.
Input:
R pointer to the struct to be deleted.
int
pmr_blpCheck(BL_PERMUT P, PERMUT_ROUT R);
Purpose: Check a routing by comparing the result of a permutation
with computed destination address.
Input:
P : blp, R : permRout.
Output: 0 if success, 1 if failure.
Note:
This routine is intended for debugging. It is also
useful for detecting hardware errors; make a very large
permutation, and check if it is works correctly. Memory or
router errors should show up.
int
pmr_PxBlpCheck(BL_PERMUT P,
plural unsigned prm(plural unsigned tadr, int mode, void
*vp),
PERMUT_ROUT R);
Purpose: Same as permRout_blpCheck() for routings produced by
permRout_PxBlpNew(). Will be documented in a later release.
Routines of category 3:
---------------------void
permut(void *arr, int blksiz, int nblk, int direction, PERMUT_ROUT R);
Purpose: Execute a permutation of an array arr, where each element in
the array occupy blksiz*nblk bytes of space.
Input:
arr: array to be permuted.
blksiz: size of blocks sent by the router in the permutation.
This number must be 1, 2, 4 or 8.
nblk:
the number of blocks in a matrix element.
direction: +1 for forward permutation. -1 for inverse.
Note:
Highest speed is achieved when blksiz is as large as
possible.
Example: If arr is declared as:
plural float arr[arrsize];
then the following calls are equivalent:
permut(arr,1,4,1,R);
and
permut(arr,4,1,1,R);
although the latter is *considerably* faster.
In the latter case, the routine assumes that the starting
address is properly aligned for reading 'plural long' data.
Generally this should not cause problems. The former call
reads 'plural char' from memory, and is thus valid for all
alignments.
void
permut32(void *arr, int dir, PERMUT_ROUT R);
void
permut64(void *arr, int dir, PERMUT_ROUT R);
Purpose: Special versions of permut for permuting 32bit and 64bit
data.
Example: If arr is declared as:
plural double arr[arrsize];
Then the following calls are equivalent:
permut(arr,8,1,1,R);
and
permut64(arr,1,R);
The latter is *slightly* faster, but the difference is in
practice negligible, thus the routines permut32() and
permut64()
can in practice always be replaced by permut().
Calling p3pack from Fortran (HPF)
--------------------------------Although the p3pack subroutines are written in MPL, an interface to
Fortran is provided. An Fortran 90 or High Performance Fortran compiler
is required, i.e. either MasPar Fortran or DECmpp HPF.
Some of the MPL routines described above cannot be called from Fortran
simply because that would not be useful (please contact the author if you
disagree). Since the calling sequences and parameter types are slightly
different from the MPL version, a complete description for Fortran use
is included below.
Compiler directives:
All p3pack subroutines that are called from Fortran must have been
declared
as MPL subroutines by the compiler directive
CMPF MPL subroutine1 subroutine2 ...
Compiling and linking:
mpfortran [options] sourceCode -lp3pack
Description of the routines:
--------------------------The routines are in 3 categories.
1) Routines working symbolically on bit-linear-permutations (blp).
These are generally fast, and work on singular objects in the ACU.
2) Routines creating routing information (permRout) from blp. Routing
information is stored in the DPU as plural data. These routines
are the slowest routines, and should be used as seldom as possible,
i.e. save and re-use routing information whenever possible.
3) Routines applying permutation information on a dataset to
accomplish a permutation. These routines are as fast as the
hardware
allows.
Routines of category 1:
---------------------subroutine blp_identNew(blptr, nbits)
integer blptr, nbits
Purpose: Create the identity btprm.
Inputs: nbits : Size of matrix.
Output: blptr : a pointer to a new identity blp.
subroutine blp_btprmNew(blptr, btprm, nbits)
integer blptr, nbits
integer btprm(nbits)
Purpose: Create a new blp from a btprm.
Inputs: btprm : integer vector defining the bit-permutation.
nbits : number of bits in the vector.
Output: blptr : a pointer to a new blp.
Note : The vector btprm *must* be stored on the front end.
You may want to use the compiler directive
CMPF ONFE btprm
to ensure this.
subroutine blp_iMatNew(blptr, A, b, nbits)
integer blptr, nbits
integer A(nbits,nbits), b(nbits)
Purpose: Create a new blp from an integer matrix and integer vector.
Inputs: A : integer matrix of size nbits x nbits. The matrix should
only contain 0's and 1's.
b : right hand side vector of size nbits. The vector should
only contain 0's and 1's.
Output:
Note :
nbits : Size of the matrix and the vector.
blptr : a pointer to a new blp.
The arrays A and b *must* be stored on the front end.
You may want to use the compiler directive
CMPF ONFE A, b
to ensure this.
subroutine blp_copyNew(blptr, p)
integer blptr, p
Purpose: Create a new blp as an exact copy of another blp.
Input:
p : pointer to the blp to be copied.
Output: blptr : pointer to the copy.
subroutine blp_free(p)
integer p
Purpose: Release the memory associated with a blp.
Input:
p : pointer to the struct to be deleted.
subroutine blp_copy(A, B)
integer A, B
Purpose: Copy the contents of a blp to an existing blp of the same
size.
Input:
B : blp to be copied.
Update: A is changed to a copy of B.
Limitations: Error if A and B are of different size, or if A is not
created before the call to the routine.
subroutine blp_print(P);
integer P
Purpose: Print the matrix and right vector in the blp in a nice
format.
Input:
P blp.
Output: Printing contents of P to stdout.
subroutine blp_rMult(A, B)
integer A, B
Purpose: Compute the product (i.e. the composition) of two blp's.
If A and B are two blp's then AB is the blp obtained by
FIRST executing B, and AFTERWARDS expecting A.
Input:
A and B, blp.
Update: A blp. A is updated to the new value AB.
subroutine blp_lMult(A, B)
integer A, B
Purpose: The same as blp_rMult, but the multiplication is in the
opposite order, i.e.B is multiplied from LEFT instead of
right.
Input:
A and B, blp.
Update: A blp. A is updated to the new value BA.
subroutine blp_inv(P)
integer P
Purpose: Compute the inverse of a blp.
Input:
P blp.
Update:
P blp, P is updated to the inverse permutation.
subroutine blp_sDestAd(dstadr, P, srcadr)
integer P
integer dstadr, srcadr
Purpose: Compute the destination address, given the source address.
Input:
srcadr : source address.
P : blp.
Output: dstadr : destination address.
subroutine blp_pDestAd(dstadr, P, srcadr)
integer P
integer dstadr(nproc), srcadr(nproc)
Purpose: Compute a set of destination address, given the source
addresses.
Input:
srcadr : source addresses.
P : blp.
Output: dstadr : destination addresses.
Note:
The adresses dstadr and srcadr are arrays that *must* be
stored
on the DPU. You may want to use the compiler directive
CMPF ONDPU dstadr, srcadr
to ensure this.
Routines of category 2:
---------------------subroutine blp_rout(R, A)
integer R, A
Purpose: Create routing information from a blp.
Input:
P blp.
Output: Creates a struct containing the routing information, and
returns R; a pointer to the struct.
Limitations: Error message when the input matrix is singular.
subroutine pmr_free(R)
integer R
Purpose: Release the memory associated with a permRout.
Input:
R pointer to the struct to be deleted.
subroutine pmr_blpCheck(err, P, R)
Purpose: Check a routing by comparing the result of a permutation
with computed destination address.
Input:
P : blp, R : permRout.
Output: ierr: 0 if success, 1 if failure.
Note:
This routine is intended for debugging. It is also
useful for detecting hardware errors; make a very large
permutation, and check if it is works correctly. Memory or
router errors should show up.
Routines of category 3:
----------------------
subroutine permut(arr, blksiz, nblk, dir, R)
integer blksiz, nblk, dir, R
<any type> arr(*)
Purpose: Execute a permutation of an array arr, where each element in
the array occupy blksiz*nblk bytes of space.
Input:
arr: array to be permuted.
blksiz: size of blocks sent by the router in the permutation.
This number must be 1, 2, 4 or 8.
nblk:
the number of blocks in a matrix element.
dir:
+1 for forward permutation. -1 for inverse.
Note:
The array arr *must* be stored on the DPU. You may want
to use the compiler directive
CMPF ONDPU arr
to ensure this.
Comment: Highest speed is achieved when blksiz is as large as
possible.
Example: If arr is declared as:
real arr(arrsize)
then the following calls are equivalent:
call permut(arr,1,4,1,R)
and
call permut(arr,4,1,1,R)
although the latter is *considerably* faster.
In the latter case, the routine assumes that the starting
address is properly aligned for reading 'plural long' data.
Generally this should not cause problems. The former call
reads 'plural char' from memory, and is thus valid for all
alignments.
permut32(arr, dir, R)
integer dir, R
<any type> arr(*)
permut64(arr, dir, R)
integer dir, R
<any type> arr(*)
Purpose: Special versions of permut for permuting 32bit and 64bit
data.
Example: If arr is declared as:
double precision arr(arrsize)
Then the following calls are equivalent:
call permut(arr,8,1,1,R);
and
call permut64(arr,1,R);
The latter is *slightly* faster, but the difference is in
practice negligible, thus the routines permut32() and
permut64()
can in practice always be replaced by permut().
Download