Dynamic Partitioning of the Divide-and-Conquer Scheme with Migration in PVM Environment Pawel Czarnul , Karen Tomko , and Henryk Krawczyk Electrical Engineering and Computer Science, University of Michigan, U.S.A. pczarnul@eecs.umich.edu, Dept. of Electrical and Computer Engineering and Computer Science University of Cincinnati, U.S.A., ktomko@ececs.uc.edu Faculty of Electronics, Telecommunications and Informatics Technical University of Gdansk, Poland, hkrawk@pg.gda.pl Abstract We present a new C++ framework which enables writing of divideand-conquer (DaC) applications very easily which are then automatically parallelized by dynamic partitioning of the DaC tree and process migration. The solution is based on DAMPVM – the extension of PVM. The proposed system handles irregular applications and dynamically adapts the allocation to minimize execution time which is shown for numerical adaptive quadrature integration examples of two different functions. 1 Introduction The divide-and-conquer scheme is widely used since it can be used in many algorithms e.g. sorting, integration, n-body simulation ([1]). Mapping a DaC tree onto a distributed network is not an easy task since it may be significantly unbalanced and may change at runtime unpredictably as intermediate results are obtained. Moreover, available hardware may be heterogeneous in the aspect of the system architecture, topology and processor speeds. We have developed an easy-to-use object-oriented C++-based DaC framework which is adaptively mapped to a multi-user distributed memory system at runtime. At first, top-level branches of the DaC tree are executed as separate processes to keep all the processors busy. Then the proposed framework is automatically parallelized by DAMPVM ([2], [3], [4]) on which it is based. It partitions the DAC tree dynamically if some processors become underutilized, then migrates tasks from other processors to achieve the lowest possible execution time. The only development a user has to do is to derive their own C++ class from the supported DAC template class and override only a few virtual methods. The framework does not require any parallel programming specific code, is very general and thus very easy to use. There are existing systems which facilitate parallel implementation of DaC based algorithms. APERITIF (Automatic Parallelization of Divide and Conquer Algorithms, formerly APRIL, [5]) This research was supported in part by the Army Research Office /CECOM under the project for ”Efficient Numerical Solutions to Large Scale Tactical Communications Problems”, (DAAD19-00-1-0173). translates C programs to be run on parallel computers with the use of PVM ([6]). REAPAR (REcursive programs Automatically PARallelized, [7], [8]) derives thread-based parallel programs from C recursive code to be executed on SMP machines. Cilk ([9]) is a similar thread-based approach. An extension of this language towards a more global domain with the use of the Java technology is presented in the ATLAS ([10]) system. Satin ([11]) is another Java-based approach targeted for distributed memory machines. Other framework-based approaches are Frames ([12]) and an object-oriented Beeblebox ([13]) system. An algebraic DaC model is described in [14]. The main contribution of this work is its capability of using heterogeneous process migration (Section 2.2) in mapping work to a system in addition to dynamic partitioning. Migration can tune the assignment at runtime and balance work without spawning more tasks if enough tasks are already in the system. This enables to handle unbalanced DaC applications (on the contrary to PVM-based APRIL where subtrees should be of the same size to achieve good performance) in a multi-user environment. Other works describe mapping of the DaC scheme to various system topologies ([15], [16]). Architecture-cognizant analysis in which different variants may be chosen at different levels is presented in [17]. 2 Divide-and-Conquer Framework Figure 1 presents a general DaC paradigm (as considered by all the other approaches) as a pseudocode with some extensions provided by the proposed framework. It is assumed that each node in the DAC tree receives a data vector delimited by left and right pointers Object *vector l and Object *vector r. Object is a template parameter for the abstract class DAC and a user-derived class from DAC should instantiate Object with a class/type suitable for its needs e.g. double for sorting vectors of double numbers. In general vector (vector l, vector r) is either a terminal node (if a userdefined method DaC Terminate(vector l, vector r) returns true) or is divided further into some number of subvectors by method DaC Divide(vector l, vector r) which returns a list of left and right pointers to the subvectors. In the first case method DaC LeafComputations(vector l, vector r) should provide leaf computations and in the latter method DaC PreComputations(vector l, vector r) is executed and then the recursive call takes place. The procedure is repeated at deeper recursion levels. Upon return method DaC PostComputations(new vectors, vector l, vector r) may provide code which merges subvectors into the parent vector. This scheme is general and allows different numbers of subvectors at each node and different depths (unbalanced trees) depending on an application’s needs (e.g. computation accuracy may determine the depth). 2.1 Dynamic Partitioning The above well-known ([1]) scheme has been extended with dynamic partitioning of the DAC tree as well as migration provided by DAMPVM. The main idea is that if the tree is very unbalanced then the initial partitioning may provide very poor utilization of some processors resulting in low speed-up values and poor scalability. Since often computation times may not be known in advance static partitioning may not give good scalability t e m p l a t e c l a s s O b j e c t / / main r e c u r s i v e method v o i d DAC O b j e c t ::DaC ( O b j e c t & v e c t o r l , O b j e c t & v e c t o r r ) i f ( DAC h a s b e e n r e q u e s t e d f o r h i g h e r l e v e l t h a n c u r r e n t ) spawn c h i l d r e n ; s e n d d a t a ; 2 4 i f ( the highest depth ) r e c e i v e d a t a from p a r e n t ; D a C I n i t i a l i z e ( v e c t o r l , v e c t o r r ) ; 6 8 i f ( DaC Terminate ( v e c t o r l , v e c t o r r ) ) DaC LeafComputations ( v e c t o r l , v e c t o r r ) ; else DaC PreComputations( v e c t o r l , v e c t o r r ) ; nHowManyNewVectors=Dac HowManyNodes ( v e c t o r l , v e c t o r r ) ; n e w v e c t o r s =DaC Divide ( v e c t o r l , v e c t o r r ) ; i f ( more t a s k s n e e d e d ) spawn t a s k s ; s e n d d a t a ; i n f o r m DAMPVM t h a t my s i z e = D a C V e c t o r S i z e ( n e w v e c t o r s [ 0 ] , n e w v e c t o r s [ 1 ] ) ; DaC ( n e w v e c t o r s [ 0 ] , n e w v e c t o r s [ 1 ] ) ; else i f ( no t a s k s h a v e b e e n spawned ) e n a b l e m i g r a t i o n f o r t h i s p r o c e s s ; f o r ( i n t nTask = 0 ; nTask nHowManyNodesExecutedByThisTask ; nTask ++) DaC( n e w v e c t o r s [ 2 nTask ] , n e w v e c t o r s [ 2 nTask + 1 ] ) ; 10 12 14 16 18 20 22 24 26 DaC PostComputations ( new vectors , v e c t o r l , v e c t o r r ) ; i f ( t h e h i g h e s t l e v e l and I am n o t t h e r o o t ) s e n d d a t a t o p a r e n t ; 28 t e m p l a t e c l a s s O b j e c t v o i d DAC O b j e c t :: Run ( v o i d ) / / i n i t i a l i z a t i o n method Object v e ct o r l , vec to r r ; i f ( ( P C P a r e n t ( ) = = PCNoParent ) ( P C H o w S t a r t e d ( ) = = m i g r a t e d ) ) I n i t i a l i z e D a t a ( v e c t o r l , v e c t o r r ) ; / / r o o t p r o c e s s o r m i g r a t e d one DaC I ni ti ali ze ( v e ct o r l , vec to r r ) ; / / execute t h i s 30 32 34 36 38 40 i f ( d a t a has been s e n t t o c h i l d r e n ) i n f o r m DAMPVM I may be i d l e w a i t i n g ; r e c e i v e d a t a ; DaC ( v e c t o r l , v e c t o r r ) ; / / a c t i v a t e t h e r e c u r s i v e c o d e / / ( childr en read data inside ) i f ( P C P a r e n t ( ) = = PCNoParent ) M a s t e r R e p o r t ( v e c t o r l , v e c t o r r ) ; Figure 1: DAC Recursive Code and Initialization Method and dynamic reassignment is necessary. Dynamic partitioning of the DAC tree is shown in Figure 2. Initially the whole tree shown is supposed to be executed by one process. Each process keeps variable nCurrentPartioningLevel which indicates the highest level (higher levels are closer to the root and are denoted by lower numbers) on which the tree may be partitioned. There are as many DAMPVM schedulers as the number of nodes each running on a different host. When a scheduler detects a load below a certain threshold on its machine it requests dynamic partitioning of the largest processes on more loaded neighboring nodes – a neighbor graph may be freely defined ([2]). New tasks are always spawned on underloaded nearest neighbors and migration is used to tune the allocation ([3]). A user-defined function DaC VectorSize(vector l, vector r) returns predicted amount of work (in some units) for the given vector. Sometimes this may be determined by the complexity function for the algorithm with some coefficients. Sometimes however it may not be known precisely but only estimated in advance. The purpose of this method is to prompt DAMPVM which processes should be dynamically partitioned and which migrated. When a process with the tree as shown in Figure 2 receives dynamic partitioning request 0 it partitions the tree at the nCurrentPartioningLevel level. It means if there are more iterations of the loop (line 20 in Figure 1) at the nCurrentPartioningLevel level they are assigned to other processes which are automatically created at runtime and the corresponding data is forwarded to the new processes. The process continues to work on its part of the tree and then receives results from the dynamically spawned processes at the nCurrentPartioningLevel level. Moreover, the proposed scheme allows many requests to be received by a process and thus multiple partitioning. If the same process receives request 1 it does not partition itself at level 2 since there are no more iterations at this level available for other processes. However, it partitions itself at level 3 at which there are 2 more iterations which are assigned to processes 2 and 3 respectively and corresponding tasks are spawned. Again at level 3 data is collected from the spawned processes. 2.2 Process Migration DAMPVM provides the ability of moving a running process from one machine (stopping it there) to another (restarting) which we refer to as process migration. The state of a process is transferred at the code level (at the expense of additional programming effort) not the system level but still provides the same functionality. For spawn/migration details see [3] and [4]. As shown in Figure 1 migration is enabled for the current process ([2]) if the system does not need more tasks to be spawned (because all the processors are already busy) and no tasks have been spawned before by the process. As described in [2] and [3] migration can be triggered by dynamic process size changes including spawns/terminations (irregular applications) and other users’ time-consuming processes. If a DAMPVM scheduler wants to migrate a task its execution is interrupted by calling a PVM message handler which activates function PackState(). This function packs all the necessary data which describes the current process state. A new copy is spawned in a special mode on another machine which unpacks its state in function UnPackState(). Both of them need to be supported by a programmer. Moreover, some programs require a special programming style to recover the process state. In return for that a user is provided with very fast, flexible and heterogeneous migration. 3 Numerical Adaptive Quadrature Integration Example As an example we have implemented a DAC-based numerical integration example which integrates any given function. The idea is proposed in [1]. In general a certain function and range are given. As shown in Figure 3a if area C is small enough then we can terminate the DAC strategy and compute as a sum of areas A and B. #" $" Otherwise range ! is divided into two and and the operation is repeated. To prevent the algorithm from termination for functions as shown #"in Figure 3b we pick out ten different points inside the range instead of only one and perform the C area check ten times before going deeper. Intuitively, such an algorithm will give similar execution times for the same size subranges for some classes of functions e.g. periodic functions with the period much smaller than the initial range. However, execution times may vary greatly for irregular functions. Since the algorithm does not know the function in advance it can assume that two ranges and %&(') take the same time if *,+-.0/12'3+-%4 . If only static partitioning is used it will result in some processors finishing their work much sooner than the others. The proposed scheme is to create a sufficient number of processes at runtime thanks to dynamic partitioning and balance them using migration. Process level 1 2 3 4 1 2 3 C a process(es) spawn 1 Figure 2: Dynamic Partitioning of DAC Tree A B a. dynamic partitioning request 0 5 0 a A b C=0 B b b. Figure 3: Integration Algorithm The above example has almost a trivial implementation in the proposed DAC scheme which is presented in Figure 4. In this implementation a vector has always three elements: the left and right contain the range extreme left and right coordinates and the middle one contains 0 when the integration of the range starts and the computed value upon return. Thus functions for migration (Figure 4, [2] and [3]) can contain only these three values. If a process computes a vector it always does this from left to right as it is implied by the DAC scheme. If its execution is interrupted i.e. process migration occurs it simply delivers the value for the left computed subrange as the initial value and the right not yet computed range for the migrated process (its new copy). 4 Experimental Results The initial implementation of the DaC scheme described above has been tested on Linux/Sun workstations. The first four nodes run Linux and the fifth is a Sun workstation. The Linux workstations have a performance of 25 relative to the Sun with a performance of 16.1. The experiments were performed on two different integration examples, the functions are given below: 5/7698;: 2 : =<(>> 2?'@ ; this function is periodic and thus the execution 1. > times forC same-size ranges should be similar, 6 8;: EDGFHIFKJMLNDNO 4 : <(>W A2?'@ P+QJMLND)ORJSL)DNO5FQIFTL&UVDNO > Notice that in the latter case range X <>W AYX') will be integrated almost immediately >W since area C willX always be 0 regardless a pivot point chosen. On the other hand the integration of >W AYX') will take long time. If there’s only static partitioning and > 2. A2B/ double f I n i t i a l L e f t , f I n i t i a l R i g h t ; / / t o p l e v e l r a n g e f o r t h i s p r o c e s s d o u b l e f R i g h t , f V a l = 0 ; / / c u r r e n t r i g h t c o o r d i n a t e and v a l u e / / for range [ f I n i t i a l L e f t , fRight ] double f I n t e g r a t i o n R a n g e [ 3 ] ; / / i n i t i a l l e f t c o o r d i n a t e , v a l u e , r i g h t c o o r d i n a t e 2 4 v o i d P a c k S t a t e ( ) / / p a c k t h e v a l u e s ( l e f t c o o r d i n a t e , c o m p u t e d v a l u e and t h e // r i g h t c o o r d i n a t e ) f o r t h e uncomputed r a n g e [ f R i g h t , f I n i t i a l R i g h t ] PC PkDouble (& f R i g h t ) ; PC PkDouble (& f V a l ) ; PC PkDouble (& f I n i t i a l R i g h t ) ; 6 8 v o i d U n P a c k S t a t e ( ) / / u n p a c k t h e c o o r d i n a t e s and i n i t i a l v a l u e PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 0 ] ) ) ; PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 1 ] ) ) ; PC UPkDouble (&( f I n t e g r a t i o n R a n g e [ 2 ] ) ) ; f V a l = f I n t e g r a t i o n R a n g e [ 1 ] ; 10 12 i n t MyDAC : : Dac HowManyNodes ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) r e t u r n 2 ; l o n g MyDAC : : D a C V e c t o r S i z e ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) r e t u r n ( l o n g ) ( 1 0 0 ( v e c t o r r Z[ v e c t o r l ) ) ; / / p r e d i c t e d p r o c e s s s i z e 14 16 v o i d MyDAC : : D a C I n i t i a l i z e ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) / / remember t o p l e v e l r a n g e and s e t t h e i n i t i a l r i g h t c o o r d i n a t e f I n i t i a l L e f t = v e c t o r l ; f I n i t i a l R i g h t = v e c t o r r ; f R i g h t = f I n i t i a l L e f t ; 18 20 d o u b l e & MyDAC : : D a C D i v i d e ( d o u b l e t a b l , d o u b l e t a b r ) d o u b l e f P i v o t =( t a b l + t a b r ) / 2 ; / / t h e s e a r e new s u b v e c t o r s Z=Z a l l o c a t e memory f o r them new vectors [0]= t a b l ; new vectors [ 1 ] = 0 ; new vectors [2]= f P i v o t ; new vectors [3]= f P i v o t ; new vectors [ 4 ] = 0 ; new vectors [5]= t a b r ; / / r etur n poi nter s to subvectors ’ coordinates t a b p o i n t e r s [0]= new vectors ; t a b p o i n t e r s [1]= new vectors +2; t a b p o i n t e r s [2]= new vectors +3; t a b p o i n t e r s [3]= new vectors +5; return t a b p o i n t e r s ; 22 24 26 28 30 v o i d MyDAC : : D a C P o s t C o m p u t a t i o n s ( d o u b l e & n e w t a b , d o u b l e & t a b l , d o u b l e & t a b r ) ( t a b l +1)+=( n e w t a b ) [ 1 ] + ( n e w t a b ) [ 4 ] ; / / add r e s u l t s from c h i l d r e n / / d e a l l o c a t e memory a s s o c i a t e d w i t h t h e s u b v e c t o r s 32 34 b o o l MyDAC : : D a C T e r m i n a t e ( d o u b l e t a b l , d o u b l e t a b r ) i f ( ComputeCArea ( t a b l , t a b r ,& f P i v o t ) 0 . 0 0 0 0 0 0 0 0 1 f o r 1 0 p i v o t s ) return true ; el s e return f a l s e ; 36 38 v o i d MyDAC : : D a C L e a f C o m p u t a t i o n s ( d o u b l e t a b l , d o u b l e t a b r ) f l o a t f A r e a = C o m p u t e T r a p e z o i d A r e a ( t a b l , t a b r ) ; / / add ( A+B ) a r e a s and remember ( t a b l +1)+= f A r e a ; f V a l += f A r e a ; f R i g h t = t a b r ; / / t h e c u r r e n t r i g h t c o o r d i n a t e 40 42 v o i d MyDAC : : I n i t i a l i z e D a t a ( d o u b l e & t a b l , d o u b l e & t a b r ) / / e x e c u t e d f o r e v e r y new p r o c e s s Z i n i t i a l i z e d a t a t a b l =( double ) f I n t e g r a t i o n R a n g e ; t a b r = ( ( double ) f I n t e g r a t i o n R a n g e + 2 ) ; f I n i t i a l R i g h t = t a b r ; f I n i t i a l L e f t = t a b l ; f R i g h t = f I n i t i a l L e f t ; 44 46 48 v o i d MyDAC : : M a s t e r R e p o r t ( d o u b l e v e c t o r l , d o u b l e v e c t o r r ) c o u t B ( ( v e c t o r l + 1 ) ) ; / / p r i n t t h e f i n a l r e s u l t 50 52 main ( i n t a r g c , char & a r g v ) D a C I n i t (& a r g c ,& a r g v ) ; / / DaC i n t i a l i z a t i o n i f ( PC How Started ( ) ! = migrated ) / / i f t h i s i s not a migrated task / / s e t i n i t i a l l e f t c o o r d i n a t e , r e s u l t and r i g h t c o o r d i n a t e fIntegrationRange [ 0] =0 ; fIntegrationRange [1 ] =0 ; fIntegrationRange [2]=400; 54 56 58 60 62 MyDAC mdcIntDAC(& PC PkDouble ,& PC UPkDouble ) ; / / c r e a t e a DaC o b j e c t , p a s s d a t a / / p a c k i n g and u n p a c k i n g f u n c t i o n s f o r t h e d o u b l e t y p e and a c t i v a t e t h e o b j e c t mdcIntDAC . Run ( ) ; / / s i m p l y r u n t h e o b j e c t and w a i t f o r r e s u l t s D a C F i n i s h ( ) ; / / DaC t e r m i n a t i o n Figure 4: Numerical Adaptive Quadrature Integration – Complete Source Code two available nodes, two processes would be attached to these processors to integrate ranges D\9JSLND)O and ]JSL)DNOL&U@D)O . After a while the second node becomes idle which results in the total execution time practically the same as for one processor. On the other hand, DAMPVM detects load imbalance and activates our dynamic DaC scheme which partitions range D^4JSL)DNO into two D\_)DNO and _@DNO9JSLND)O and places one process on the idle processor which results in almost the best speed-up. Obviously, the =<(>> 698;: ?'@ case should give better performance as there are no such dynamic load > imbalances. The execution times and speed-ups for short runs are shown in Figures 5 and 6. Heterogeneous migration between Linux and Solaris workstations is extremely fast for this example. Benefits from migration vs. the dynamic DaC only were observed for configurations with other users disturbing the load balance when migration can balance the load before dynamic partitioning is invoked if enough processes are available. 300 5 periodic function f(x) nonperiodic function g(x) 250 Short Run Speed-up 4.5 4 Execution time [s] 200 3.5 150 2.5 3 2 100 50 periodic function f(x) nonperiodic function g(x) optimal 1.5 1 2 3 4 number of processors 5 Figure 5: Execution Time 1 1 2 3 4 number of processors 5 Figure 6: Speed-up 5 Conclusions and Future Work We presented a dynamic divide-and-conquer scheme which aims at partitioning load dynamically and dynamic load balancing with the use of migration procedures supported by DAMPVM. The proposed implementation is able to detect load imbalance in a parallel environment at runtime, partition data to keep all the processors busy and balance their workloads. Such a scheme can partition and map a binary tree to a 3-processor system quite well which is shown in our experiments. The proposed DaC software will be available at the DAMPVM Web site ([4]). Future work will focus on closer integration of the DaC and migration schemes. A better load balancing algorithm ([18]) is currently being incorporated into the code which will give better performance for larger networks. We plan to implement many different examples and possibly enhance the proposed scheme for efficient parallel execution of different applications as well as test it on larger heterogeneous LAN networks. Also, a direct performance comparison with other existing approaches will be made including Java-based ones. References 1. B. Wilkinson and M. Allen, Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, 1999. 2. P. Czarnul and H. Krawczyk, “Dynamic Assignment with Process Migration in Distributed Environments,” in Recent Advances in Parallel Virtual Machine and Message Passing Interface, Vol. 1697 of Lecture Notes in Computer Science, pp. 509–516, 1999. 3. P. Czarnul and H. Krawczyk, “Parallel Program Execution with Process Migration,” in Proceedings of the International Conference on Parallel Computing in Electrical Engineering, (Trois-Rivieres, Canada), IEEE Computer Society, August 2000. 4. DAMPVM Web Site: http://www.ask.eti.pg.gda.pl/˜pczarnul/DAMPVM.html. 5. T. Erlebach, APRIL 1.0 User Manual, Automatic Parallelization of Divide and Conquer Algorithms. Technische Universitat Munchen, Germany, http://wwwmayr.informatik.tumuenchen.de/personen/erlebach/aperitif.html, 1995. 6. A. Geist, A. Beguelin, J. J. Dongarra, W. Jiang, R. Manchek, and V. S. Sunderam, “PVM 3 user’s guide and reference manual,” Tech. Rep. ORNL/TM-12187, Oak Ridge National Laboratory, May 1993. http://www.epm.ornl.gov/pvm. 7. L. Prechelt and S. Hngen, “Efficient parallel execution of irregular recursive programs,” Submitted to IEEE Transactions on Parallel and Distributed Systems, December 2000. http://wwwipd.ira.uka.de/˜prechelt/Biblio/Biblio/reapar tpds2001.ps.gz. 8. S. Hngen, “REAPAR User Manual and Reference: Automatic Parallelization of Irregular Recursive Programs,” Tech. Rep. 8/98, Universitat Karlsruhe, 1998. http://wwwipd.ira.uka.de/˜haensgen/reapar/reapar.html. 9. R.D.Blumofe, C.F.Joerg, B.C.Kuszmaul, C.E.Leiserson, K.H.Randall, and Y.Zhou, “Cilk: An efficient multithreaded runtime system,” in Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 207–216, July 1995. 10. J.Baldeschwieler, R.Blumofe, and E.Brewer, “ATLAS: An Infrastructure for Global Computing,” in Proceedings of the Seventh ACM SIGOPS European Workshop on System Support for Worldwide Applications, 1996. 11. R. V. van Nieuwpoort, T. Kielmann, and H. E. Bal, “Satin: Efficient Parallel Divide-andConquer in Java,” in Euro-Par 2000 Parallel Processing, Proceedings of the 6th International Euro-Par Conference, No. 1900 in LNCS, pp. 690–699, 2000. 12. J. P. i Silvestre and T. Romke, “Programming Frames for the efficient use of parallel systems,” Tech. Rep. TR-183-97, Paderborn Center for Parallel Computing, January 1997. 13. A. J. Piper and R. W. Prager, “Generalized Parallel Programming with Divideand-Conquer: The Beeblebrox System,” Tech. Rep. CUED/F-INFENG/TR132, Cambridge University Engeneering Department, 1993. ftp://svr-ftp.eng.cam. ac.uk/pub/reports/piper tr132.ps.Z. 14. Z. G. Mou and P. Hudak, “An algebraic model for divide-and-conquer and its parallelism,” The Journal of Supercomputing, Vol. 2, pp. 257–278, Nov. 1988. 15. V. Lo, S. Rajopadhye, J. Telle, and X. Zhong, “Parallel Divide and Conquer on Meshes,” IEEE Transactions on Parallel and Distributed Systems, Vol. 7, No. 10, pp. 1049–1057, 1996. 16. I. Wu, “Efficient parallel divide-and-conquer for a class of interconnection topologies,” in Proceedings of the 2nd International Symposium on Algorithms, No. 557 in Lecture Notes in Computer Science, (Taipei, Republic of China), pp. 229–240, Springer-Verlag, Dec. 1991. 17. K. S. Gatlin and L. Carter, “Architecture-Cognizant Divide and Conquer Algorithms,” SuperComputing ’99, Nov. 1999. 18. P. Czarnul, K. Tomko, and H. Krawczyk, “A Heuristic Dynamic Load Balancing Algorithm for Meshes,” (Anaheim, CA, USA), 2001. accepted for presentation in PDCS’2001.