Dynamic Partitioning of the Divide-and

advertisement
Dynamic Partitioning of the Divide-and-Conquer
Scheme with Migration in PVM Environment
Pawel Czarnul , Karen Tomko , and Henryk Krawczyk
Electrical Engineering and Computer Science, University of Michigan, U.S.A.
pczarnul@eecs.umich.edu,
Dept. of Electrical and Computer Engineering and Computer Science
University of Cincinnati, U.S.A., ktomko@ececs.uc.edu
Faculty of Electronics, Telecommunications and Informatics
Technical University of Gdansk, Poland, hkrawk@pg.gda.pl
Abstract We present a new C++ framework which enables writing of divideand-conquer (DaC) applications very easily which are then automatically parallelized by dynamic partitioning of the DaC tree and process migration. The
solution is based on DAMPVM – the extension of PVM. The proposed system
handles irregular applications and dynamically adapts the allocation to minimize
execution time which is shown for numerical adaptive quadrature integration examples of two different functions.
1 Introduction
The divide-and-conquer scheme is widely used since it can be used in many algorithms
e.g. sorting, integration, n-body simulation ([1]). Mapping a DaC tree onto a distributed
network is not an easy task since it may be significantly unbalanced and may change at
runtime unpredictably as intermediate results are obtained. Moreover, available hardware may be heterogeneous in the aspect of the system architecture, topology and
processor speeds. We have developed an easy-to-use object-oriented C++-based DaC
framework which is adaptively mapped to a multi-user distributed memory system at
runtime. At first, top-level branches of the DaC tree are executed as separate processes
to keep all the processors busy. Then the proposed framework is automatically parallelized by DAMPVM ([2], [3], [4]) on which it is based. It partitions the DAC tree
dynamically if some processors become underutilized, then migrates tasks from other
processors to achieve the lowest possible execution time. The only development a user
has to do is to derive their own C++ class from the supported DAC template class and
override only a few virtual methods. The framework does not require any parallel programming specific code, is very general and thus very easy to use. There are existing
systems which facilitate parallel implementation of DaC based algorithms. APERITIF
(Automatic Parallelization of Divide and Conquer Algorithms, formerly APRIL, [5])
This research was supported in part by the Army Research Office /CECOM under the
project for ”Efficient Numerical Solutions to Large Scale Tactical Communications Problems”,
(DAAD19-00-1-0173).
translates C programs to be run on parallel computers with the use of PVM ([6]). REAPAR (REcursive programs Automatically PARallelized, [7], [8]) derives thread-based
parallel programs from C recursive code to be executed on SMP machines. Cilk ([9]) is
a similar thread-based approach. An extension of this language towards a more global
domain with the use of the Java technology is presented in the ATLAS ([10]) system.
Satin ([11]) is another Java-based approach targeted for distributed memory machines.
Other framework-based approaches are Frames ([12]) and an object-oriented Beeblebox ([13]) system. An algebraic DaC model is described in [14]. The main contribution
of this work is its capability of using heterogeneous process migration (Section 2.2) in
mapping work to a system in addition to dynamic partitioning. Migration can tune the
assignment at runtime and balance work without spawning more tasks if enough tasks
are already in the system. This enables to handle unbalanced DaC applications (on the
contrary to PVM-based APRIL where subtrees should be of the same size to achieve
good performance) in a multi-user environment. Other works describe mapping of the
DaC scheme to various system topologies ([15], [16]). Architecture-cognizant analysis
in which different variants may be chosen at different levels is presented in [17].
2 Divide-and-Conquer Framework
Figure 1 presents a general DaC paradigm (as considered by all the other approaches) as
a pseudocode with some extensions provided by the proposed framework. It is assumed
that each node in the DAC tree receives a data vector delimited by left and right pointers Object *vector l and Object *vector r. Object is a template parameter for
the abstract class DAC and a user-derived class from DAC should instantiate Object
with a class/type suitable for its needs e.g. double for sorting vectors of double numbers. In general vector (vector l, vector r) is either a terminal node (if a userdefined method DaC Terminate(vector l, vector r) returns true) or is divided
further into some number of subvectors by method DaC Divide(vector l, vector r) which returns a list of left and right pointers to the subvectors. In the first case
method DaC LeafComputations(vector l, vector r) should provide leaf computations and in the latter method DaC PreComputations(vector l, vector r) is
executed and then the recursive call takes place. The procedure is repeated at deeper recursion levels. Upon return method DaC PostComputations(new vectors, vector l, vector r) may provide code which merges subvectors into the parent vector.
This scheme is general and allows different numbers of subvectors at each node and different depths (unbalanced trees) depending on an application’s needs (e.g. computation
accuracy may determine the depth).
2.1 Dynamic Partitioning
The above well-known ([1]) scheme has been extended with dynamic partitioning of the
DAC tree as well as migration provided by DAMPVM. The main idea is that if the tree is
very unbalanced then the initial partitioning may provide very poor utilization of some
processors resulting in low speed-up values and poor scalability. Since often computation times may not be known in advance static partitioning may not give good scalability
t e m p l a t e c l a s s O b j e c t / / main r e c u r s i v e method
v o i d DAC O b j e c t ::DaC ( O b j e c t & v e c t o r l , O b j e c t & v e c t o r r ) i f ( DAC h a s b e e n r e q u e s t e d f o r h i g h e r l e v e l t h a n c u r r e n t ) spawn c h i l d r e n ; s e n d d a t a ;
2
4
i f ( the highest depth ) r e c e i v e d a t a from p a r e n t ; D a C I n i t i a l i z e ( v e c t o r l , v e c t o r r ) ;
6
8
i f ( DaC Terminate ( v e c t o r l , v e c t o r r ) ) DaC LeafComputations ( v e c t o r l , v e c t o r r ) ;
else DaC PreComputations( v e c t o r l , v e c t o r r ) ;
nHowManyNewVectors=Dac HowManyNodes ( v e c t o r l , v e c t o r r ) ;
n e w v e c t o r s =DaC Divide ( v e c t o r l , v e c t o r r ) ;
i f ( more t a s k s n e e d e d ) spawn t a s k s ; s e n d d a t a ;
i n f o r m DAMPVM t h a t my s i z e = D a C V e c t o r S i z e ( n e w v e c t o r s [ 0 ] , n e w v e c t o r s [ 1 ] ) ;
DaC ( n e w v e c t o r s [ 0 ] , n e w v e c t o r s [ 1 ] ) ;
else i f ( no t a s k s h a v e b e e n spawned ) e n a b l e m i g r a t i o n f o r t h i s p r o c e s s ;
f o r ( i n t nTask = 0 ; nTask nHowManyNodesExecutedByThisTask ; nTask ++)
DaC( n e w v e c t o r s [ 2 nTask ] , n e w v e c t o r s [ 2 nTask + 1 ] ) ;
10
12
14
16
18
20
22
24
26
DaC PostComputations ( new vectors , v e c t o r l , v e c t o r r ) ;
i f ( t h e h i g h e s t l e v e l and I am n o t t h e r o o t ) s e n d d a t a t o p a r e n t ;
28
t e m p l a t e c l a s s O b j e c t
v o i d DAC O b j e c t :: Run ( v o i d ) / / i n i t i a l i z a t i o n method
Object v e ct o r l , vec to r r ;
i f ( ( P C P a r e n t ( ) = = PCNoParent ) ( P C H o w S t a r t e d ( ) = = m i g r a t e d ) ) I n i t i a l i z e D a t a ( v e c t o r l , v e c t o r r ) ; / / r o o t p r o c e s s o r m i g r a t e d one
DaC I ni ti ali ze ( v e ct o r l , vec to r r ) ; / / execute t h i s
30
32
34
36
38
40
i f ( d a t a has been s e n t t o c h i l d r e n ) i n f o r m DAMPVM I may be i d l e w a i t i n g ; r e c e i v e d a t a ;
DaC ( v e c t o r l , v e c t o r r ) ; / / a c t i v a t e t h e r e c u r s i v e c o d e
/ / ( childr en read data inside )
i f ( P C P a r e n t ( ) = = PCNoParent ) M a s t e r R e p o r t ( v e c t o r l , v e c t o r r ) ;
Figure 1: DAC Recursive Code and Initialization Method
and dynamic reassignment is necessary. Dynamic partitioning of the DAC tree is shown
in Figure 2. Initially the whole tree shown is supposed to be executed by one process.
Each process keeps variable nCurrentPartioningLevel which indicates the highest
level (higher levels are closer to the root and are denoted by lower numbers) on which
the tree may be partitioned. There are as many DAMPVM schedulers as the number of
nodes each running on a different host. When a scheduler detects a load below a certain
threshold on its machine it requests dynamic partitioning of the largest processes on
more loaded neighboring nodes – a neighbor graph may be freely defined ([2]). New
tasks are always spawned on underloaded nearest neighbors and migration is used to
tune the allocation ([3]). A user-defined function DaC VectorSize(vector l, vector r) returns predicted amount of work (in some units) for the given vector. Sometimes this may be determined by the complexity function for the algorithm with some
coefficients. Sometimes however it may not be known precisely but only estimated in
advance. The purpose of this method is to prompt DAMPVM which processes should
be dynamically partitioned and which migrated. When a process with the tree as shown
in Figure 2 receives dynamic partitioning request 0 it partitions the tree at the nCurrentPartioningLevel level. It means if there are more iterations of the loop (line
20 in Figure 1) at the nCurrentPartioningLevel level they are assigned to other
processes which are automatically created at runtime and the corresponding data is forwarded to the new processes. The process continues to work on its part of the tree and
then receives results from the dynamically spawned processes at the nCurrentPartioningLevel level. Moreover, the proposed scheme allows many requests to be received by a process and thus multiple partitioning. If the same process receives request
1 it does not partition itself at level 2 since there are no more iterations at this level
available for other processes. However, it partitions itself at level 3 at which there are 2
more iterations which are assigned to processes 2 and 3 respectively and corresponding
tasks are spawned. Again at level 3 data is collected from the spawned processes.
2.2 Process Migration
DAMPVM provides the ability of moving a running process from one machine (stopping it there) to another (restarting) which we refer to as process migration. The state of
a process is transferred at the code level (at the expense of additional programming effort) not the system level but still provides the same functionality. For spawn/migration
details see [3] and [4]. As shown in Figure 1 migration is enabled for the current process ([2]) if the system does not need more tasks to be spawned (because all the processors are already busy) and no tasks have been spawned before by the process. As
described in [2] and [3] migration can be triggered by dynamic process size changes including spawns/terminations (irregular applications) and other users’ time-consuming
processes. If a DAMPVM scheduler wants to migrate a task its execution is interrupted
by calling a PVM message handler which activates function PackState(). This function packs all the necessary data which describes the current process state. A new copy
is spawned in a special mode on another machine which unpacks its state in function
UnPackState(). Both of them need to be supported by a programmer. Moreover, some
programs require a special programming style to recover the process state. In return for
that a user is provided with very fast, flexible and heterogeneous migration.
3 Numerical Adaptive Quadrature Integration Example
As an example we have implemented a DAC-based numerical integration example
which integrates any given function. The idea is proposed in [1]. In general a certain
function and range are given. As shown in Figure 3a if area C is small enough
then we can terminate the DAC strategy and compute as a sum of areas A and B.
#"
$"
Otherwise range ! is divided into two and and the operation is repeated. To prevent the algorithm from termination for functions as shown #"in Figure 3b
we pick out ten different points inside the range instead of only one and perform the C area check ten times before going deeper. Intuitively, such an algorithm will
give similar execution times for the same size subranges for some classes of functions
e.g. periodic functions with the period much smaller than the initial range. However,
execution times may vary greatly for irregular functions. Since the algorithm does not
know the function in advance it can assume that two ranges and %&(') take the
same time if *,+-.0/12'3+-%4 . If only static partitioning is used it will result in some
processors finishing their work much sooner than the others. The proposed scheme is to
create a sufficient number of processes at runtime thanks to dynamic partitioning and
balance them using migration.
Process
level 1
2
3
4
1
2
3
C
a
process(es)
spawn
1
Figure 2: Dynamic Partitioning of DAC Tree
A
B
a.
dynamic
partitioning
request
0
5
0
a
A
b
C=0
B
b
b.
Figure 3: Integration Algorithm
The above example has almost a trivial implementation in the proposed DAC scheme
which is presented in Figure 4. In this implementation a vector has always three elements: the left and right contain the range extreme left and right coordinates and the
middle one contains 0 when the integration of the range starts and the computed value
upon return. Thus functions for migration (Figure 4, [2] and [3]) can contain only these
three values. If a process computes a vector it always does this from left to right as it is
implied by the DAC scheme. If its execution is interrupted i.e. process migration occurs
it simply delivers the value for the left computed subrange as the initial value and the
right not yet computed range for the migrated process (its new copy).
4 Experimental Results
The initial implementation of the DaC scheme described above has been tested on
Linux/Sun workstations. The first four nodes run Linux and the fifth is a Sun workstation. The Linux workstations have a performance of 25 relative to the Sun with a
performance of 16.1. The experiments were performed on two different integration examples, the functions are given below:
5/7698;: 2 : =<(>> 2?'@ ; this function is periodic and thus the execution
1. >
times forC same-size
ranges
should be similar,
6 8;: EDGFHIFKJMLNDNO
4
: <(>W A2?'@
P+QJMLND)ORJSL)DNO5FQIFTL&UVDNO
>
Notice that in the latter case range X <>W AYX') will be integrated almost immediately
>W
since area C willX always be 0 regardless
a pivot point chosen. On the other hand the
integration of >W AYX') will take long time. If there’s only static partitioning and
>
2. A2B/
double f I n i t i a l L e f t , f I n i t i a l R i g h t ; / / t o p l e v e l r a n g e f o r t h i s p r o c e s s
d o u b l e f R i g h t , f V a l = 0 ; / / c u r r e n t r i g h t c o o r d i n a t e and v a l u e
/ / for range [ f I n i t i a l L e f t , fRight ]
double f I n t e g r a t i o n R a n g e [ 3 ] ; / / i n i t i a l l e f t c o o r d i n a t e , v a l u e , r i g h t c o o r d i n a t e
2
4
v o i d P a c k S t a t e ( ) / / p a c k t h e v a l u e s ( l e f t c o o r d i n a t e , c o m p u t e d v a l u e and t h e
//
r i g h t c o o r d i n a t e ) f o r t h e uncomputed r a n g e [ f R i g h t , f I n i t i a l R i g h t ]
PC PkDouble (& f R i g h t ) ; PC PkDouble (& f V a l ) ; PC PkDouble (& f I n i t i a l R i g h t ) ;
6
8
v o i d U n P a c k S t a t e ( ) / / u n p a c k t h e c o o r d i n a t e s and i n i t i a l v a l u e
PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 0 ] ) ) ; PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 1 ] ) ) ;
PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 2 ] ) ) ; f V a l = f I n t e g r a t i o n R a n g e [ 1 ] ;
10
12
i n t MyDAC : : Dac HowManyNodes ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) r e t u r n 2 ;
l o n g MyDAC : : D a C V e c t o r S i z e ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) r e t u r n ( l o n g ) ( 1 0 0 ( v e c t o r r Z[
v e c t o r l ) ) ; / / p r e d i c t e d p r o c e s s s i z e
14
16
v o i d MyDAC : : D a C I n i t i a l i z e ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) / / remember t o p l e v e l r a n g e and s e t t h e i n i t i a l r i g h t c o o r d i n a t e
f I n i t i a l L e f t =
v e c t o r l ; f I n i t i a l R i g h t =
v e c t o r r ; f R i g h t = f I n i t i a l L e f t ;
18
20
d o u b l e &
MyDAC : : D a C D i v i d e ( d o u b l e t a b l , d o u b l e t a b r ) d o u b l e f P i v o t =( t a b l + t a b r ) / 2 ;
/ / t h e s e a r e new s u b v e c t o r s Z=Z a l l o c a t e memory f o r them
new vectors [0]= t a b l ; new vectors [ 1 ] = 0 ; new vectors [2]= f P i v o t ;
new vectors [3]= f P i v o t ; new vectors [ 4 ] = 0 ; new vectors [5]= t a b r ;
/ / r etur n poi nter s to subvectors ’ coordinates
t a b p o i n t e r s [0]= new vectors ; t a b p o i n t e r s [1]= new vectors +2;
t a b p o i n t e r s [2]= new vectors +3; t a b p o i n t e r s [3]= new vectors +5;
return t a b p o i n t e r s ;
22
24
26
28
30
v o i d MyDAC : : D a C P o s t C o m p u t a t i o n s ( d o u b l e &
n e w t a b , d o u b l e & t a b l , d o u b l e & t a b r ) ( t a b l +1)+=( n e w t a b ) [ 1 ] + ( n e w t a b ) [ 4 ] ; / / add r e s u l t s from c h i l d r e n
/ / d e a l l o c a t e memory a s s o c i a t e d w i t h t h e s u b v e c t o r s
32
34
b o o l MyDAC : : D a C T e r m i n a t e ( d o u b l e t a b l , d o u b l e t a b r ) i f ( ComputeCArea ( t a b l , t a b r ,& f P i v o t ) 0 . 0 0 0 0 0 0 0 0 1 f o r 1 0 p i v o t s )
return true ; el s e return f a l s e ;
36
38
v o i d MyDAC : : D a C L e a f C o m p u t a t i o n s ( d o u b l e t a b l , d o u b l e t a b r ) f l o a t f A r e a = C o m p u t e T r a p e z o i d A r e a ( t a b l , t a b r ) ; / / add ( A+B ) a r e a s and remember
( t a b l +1)+= f A r e a ; f V a l += f A r e a ; f R i g h t = t a b r ; / / t h e c u r r e n t r i g h t c o o r d i n a t e
40
42
v o i d MyDAC : : I n i t i a l i z e D a t a ( d o u b l e & t a b l , d o u b l e & t a b r ) / / e x e c u t e d f o r e v e r y new p r o c e s s Z i n i t i a l i z e d a t a
t a b l =( double ) f I n t e g r a t i o n R a n g e ; t a b r = ( ( double ) f I n t e g r a t i o n R a n g e + 2 ) ;
f I n i t i a l R i g h t =
t a b r ; f I n i t i a l L e f t =
t a b l ; f R i g h t = f I n i t i a l L e f t ;
44
46
48
v o i d MyDAC : : M a s t e r R e p o r t ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) c o u t B ( ( v e c t o r l + 1 ) ) ; / / p r i n t t h e f i n a l r e s u l t
50
52
main ( i n t a r g c , char &
a r g v ) D a C I n i t (& a r g c ,& a r g v ) ; / / DaC i n t i a l i z a t i o n
i f ( PC How Started ( ) ! = migrated ) / / i f t h i s i s not a migrated task
/ / s e t i n i t i a l l e f t c o o r d i n a t e , r e s u l t and r i g h t c o o r d i n a t e
fIntegrationRange [ 0] =0 ; fIntegrationRange [1 ] =0 ; fIntegrationRange [2]=400;
54
56
58
60
62
MyDAC mdcIntDAC(& PC PkDouble ,& PC UPkDouble ) ; / / c r e a t e a DaC o b j e c t , p a s s d a t a
/ / p a c k i n g and u n p a c k i n g f u n c t i o n s f o r t h e d o u b l e t y p e and a c t i v a t e t h e o b j e c t
mdcIntDAC . Run ( ) ; / / s i m p l y r u n t h e o b j e c t and w a i t f o r r e s u l t s
D a C F i n i s h ( ) ; / / DaC t e r m i n a t i o n
Figure 4: Numerical Adaptive Quadrature Integration – Complete Source Code
two available nodes, two processes would be attached to these processors to integrate
ranges D\9JSLND)O and ]JSL)DNOL&U@D)O . After a while the second node becomes idle which
results in the total execution time practically the same as for one processor. On the
other hand, DAMPVM detects load imbalance and activates our dynamic DaC scheme
which partitions range D^4JSL)DNO into two D\_)DNO and _@DNO9JSLND)O and places one process on the
idle processor which results in almost the best speed-up. Obviously, the
=<(>> 698;: ?'@ case should give better performance as there are no such dynamic load
>
imbalances.
The execution times and speed-ups for short runs are shown in Figures 5
and 6. Heterogeneous migration between Linux and Solaris workstations is extremely
fast for this example. Benefits from migration vs. the dynamic DaC only were observed
for configurations with other users disturbing the load balance when migration can balance the load before dynamic partitioning is invoked if enough processes are available.
300
5
periodic function f(x)
nonperiodic function g(x)
250
Short Run Speed-up
4.5
4
Execution time [s]
200
3.5
150
2.5
3
2
100
50
periodic function f(x)
nonperiodic function g(x)
optimal
1.5
1
2
3
4
number of processors
5
Figure 5: Execution Time
1
1
2
3
4
number of processors
5
Figure 6: Speed-up
5 Conclusions and Future Work
We presented a dynamic divide-and-conquer scheme which aims at partitioning load dynamically and dynamic load balancing with the use of migration procedures supported
by DAMPVM. The proposed implementation is able to detect load imbalance in a parallel environment at runtime, partition data to keep all the processors busy and balance
their workloads. Such a scheme can partition and map a binary tree to a 3-processor
system quite well which is shown in our experiments. The proposed DaC software will
be available at the DAMPVM Web site ([4]). Future work will focus on closer integration of the DaC and migration schemes. A better load balancing algorithm ([18])
is currently being incorporated into the code which will give better performance for
larger networks. We plan to implement many different examples and possibly enhance
the proposed scheme for efficient parallel execution of different applications as well as
test it on larger heterogeneous LAN networks. Also, a direct performance comparison
with other existing approaches will be made including Java-based ones.
References
1. B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, 1999.
2. P. Czarnul and H. Krawczyk, “Dynamic Assignment with Process Migration in Distributed
Environments,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Vol. 1697 of Lecture Notes in Computer Science, pp. 509–516, 1999.
3. P. Czarnul and H. Krawczyk, “Parallel Program Execution with Process Migration,” in Proceedings of the International Conference on Parallel Computing in Electrical Engineering,
(Trois-Rivieres, Canada), IEEE Computer Society, August 2000.
4. DAMPVM Web Site: http://www.ask.eti.pg.gda.pl/˜pczarnul/DAMPVM.html.
5. T. Erlebach, APRIL 1.0 User Manual, Automatic Parallelization of Divide and Conquer
Algorithms. Technische Universitat Munchen, Germany, http://wwwmayr.informatik.tumuenchen.de/personen/erlebach/aperitif.html, 1995.
6. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, “PVM
3 user’s guide and reference manual,” Tech. Rep. ORNL/TM-12187, Oak Ridge National
Laboratory, May 1993. http://www.epm.ornl.gov/pvm.
7. L. Prechelt and S. Hngen, “Efficient parallel execution of irregular recursive programs,”
Submitted to IEEE Transactions on Parallel and Distributed Systems, December 2000.
http://wwwipd.ira.uka.de/˜prechelt/Biblio/Biblio/reapar tpds2001.ps.gz.
8. S. Hngen, “REAPAR User Manual and Reference: Automatic Parallelization of
Irregular Recursive Programs,” Tech. Rep. 8/98, Universitat Karlsruhe, 1998.
http://wwwipd.ira.uka.de/˜haensgen/reapar/reapar.html.
9. R.D.Blumofe, C.F.Joerg, B.C.Kuszmaul, C.E.Leiserson, K.H.Randall, and Y.Zhou, “Cilk:
An efficient multithreaded runtime system,” in Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 207–216, July 1995.
10. J.Baldeschwieler, R.Blumofe, and E.Brewer, “ATLAS: An Infrastructure for Global Computing,” in Proceedings of the Seventh ACM SIGOPS European Workshop on System Support
for Worldwide Applications, 1996.
11. R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal, “Satin: Efficient Parallel Divide-andConquer in Java,” in Euro-Par 2000 Parallel Processing, Proceedings of the 6th International
Euro-Par Conference, No. 1900 in LNCS, pp. 690–699, 2000.
12. J. P. i Silvestre and T. Romke, “Programming Frames for the efficient use of parallel systems,” Tech. Rep. TR-183-97, Paderborn Center for Parallel Computing, January 1997.
13. A. J. Piper and R. W. Prager, “Generalized Parallel Programming with Divideand-Conquer: The Beeblebrox System,” Tech. Rep. CUED/F-INFENG/TR132,
Cambridge University Engeneering Department, 1993.
ftp://svr-ftp.eng.cam.
ac.uk/pub/reports/piper tr132.ps.Z.
14. Z. G. Mou and P. Hudak, “An algebraic model for divide-and-conquer and its parallelism,”
The Journal of Supercomputing, Vol. 2, pp. 257–278, Nov. 1988.
15. V. Lo, S. Rajopadhye, J. Telle, and X. Zhong, “Parallel Divide and Conquer on Meshes,”
IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 10, pp. 1049–1057,
1996.
16. I. Wu, “Efficient parallel divide-and-conquer for a class of interconnection topologies,” in
Proceedings of the 2nd International Symposium on Algorithms, No. 557 in Lecture Notes in
Computer Science, (Taipei, Republic of China), pp. 229–240, Springer-Verlag, Dec. 1991.
17. K. S. Gatlin and L. Carter, “Architecture-Cognizant Divide and Conquer Algorithms,” SuperComputing ’99, Nov. 1999.
18. P. Czarnul, K. Tomko, and H. Krawczyk, “A Heuristic Dynamic Load Balancing Algorithm
for Meshes,” (Anaheim, CA, USA), 2001. accepted for presentation in PDCS’2001.
Download