PRP - Parallel Recursive Procedures Arne Maus, Dept. of Informatics, University of Oslo email: arnem@i.uio.no Tornn Aas, Norges Bank, P.O.Box 1179, 0107 Oslo email:tornna@i.uio.no October 17, 1995 Abstract It is believed that writing parallel programs is hard. Therefore a paradigm for writing parallel programs that is not much harder than writing a sequential program is in demand. This paper describes a general method for parallelizing programs based on the idea of using recursive procedures as the grain of parallelism (PRP). This paradigm only adds a few new constructs to an ordinary sequential program with recursive procedures and makes both the underlying hardware topology and in most cases also the number of independent processing nodes, transparent to the programmer. The implementation reported in this paper uses workstations and servers connected to an Ethernet. The eciency achieved by the PRP-concept must be said to be good with a more than 50% processor utilization on the two problems reported in this paper. Other implementations of the PRP-paradigm is also reported. Index Terms: Parallel programming, recursive procedures, programming paradigms, distributed computing. 1 Introduction Writing parallel programs adds levels of complexity on top of solving the same problem on a sequential uniprocessor. This stems mainly from : 1. The need for learning and using additional language constructs - i.e learning and using a programming paradigm for writing parallel programs. 2. The problem of identifying and breaking up the original problem into 'independent' subtask to make it executable on more than one processor. Maybe the most dicult problem is the communication and synchronization between these subtasks. 3. Often the parallel solution has to take strongly into account the topology of the underlying hardware, i.e. the number of processors and their interconnect. 1 4. The parallel machines are few and far between. Hence the operating system support and programming development tools are denitely not as good as those found on more conventional platforms. As a consequence of these and other problems, development of solutions to problems using parallel machines are more error-prone, cost more and take longer time. Since for an end-user, the only thing that distinguishes these solutions from sequential solutions is their speed; their late and costly development make them often compete badly with less expensive and more reliably developed solutions using newer and faster generations of single-CPU computers. Improvements are needed in all these problem areas. No standard solution to the 'parallel programming problem' has emerged, and new and more developer-friendly paradigms for writing such programs are needed. This paper presents a new parallel programming paradigm that tries to simplify the programmer's task for the rst three areas for some fairly large classes of problems. This is a report from a master thesis [Aas 94] at the University of Oslo and partly from follow up projects. 2 Parallelizing of programs The ecient development of parallel programs has been attacked along many lines and good overviews can be found in standard textbooks as [Hennesey and Patterson 94] and [Foster 94] and in survey papers like [Andrews 91] and [Carriero 89]. We only note here that the line of attack has been diverse: To build special purpose machines, like vector processors, data-ow machines, hypercubes, and more recently symmetric multiprocessors (SMP) and then make program paradigms and tools to suit these machines. To start from the process concept with message passing and/or synchronizing primitives like semaphores and monitors from operating system theory and often include possibilities for such parallelism into programming languages like Ada and Occam. To make the process almost automatic by making minor changes to existing programs typically in FORTRAN. This is often done by calling special libraries for performing parallelized matrix operations or by changing the user program by parallelizing loops etc. This direction, which denitely has the highest number of users, are now standardizing a new HPF (High Performance FORTRAN) [HPFF 94]. To make a completely new programming paradigm like LINDA [Carreiro 89b] To hide the underlying hardware, we see standard libraries like PVM [Geist et al. 94] and MPI [Pacheco 95] to get a clear split between program and implementation - a hardware abstraction layer. Existing programs can then more easily be ported to new computers and new topologies. The idea presented in this paper uses elements from the last four approaches. 2 a a b c b e ..... ..... ..... c d ..... . . . . .. . . . . ..... ..... ..... ..... ..... ..... ..... − procedre instance performed in own process with PRP ..... e d ..... ..... ..... ..... − procedure instance performed always as procedure Fig 1 (a) Recursion three from sequential algorithm Fig 1 (b) Recursion threes when using PRP−consept Figure 1: The PRP distribution of recursive procedure instances to the set of available processors 3 The PRP-concept and related work The idea of using recursive procedures as the unit for parallelizing was put forward in [Maus 78], but not implemented before 1994 in a master thesis [Aas 94]. The idea is basicly that we have a number of processors connected with some kind of communication channel and a sequential recursive solution that generates dynamically a number of procedure instances. In a PRP-system then each time a new procedure-instance is generated, if there is some idle processor in the system, then the procedure-instance turns into a process that migrates to that idle processor. If all available processors are busy, then the procedure-instance is created as an ordinary procedure-instance on the same processor where the recursive procedure call was made and executed with the same eciency as an ordinary recursive procedure. The above picture sets out the general idea. How we administer the program and the pool of idle processors might, as we will later see, slightly modify this picture. This is illustrated in g. 1. In g 1a we see an ordinary sequential recursion tree performed on a single processor. In g 1b we assume as an illustration 5 interconnected processors, and see how the top 5 procedure instances are executed in its own process on each of the 5 processors, and most importantly how the recursion tree is split and the total work is split. We note that the more time consuming activities (process generation and communication on the net) is only performed for a limited number of procedure instances at the top of the recursion tree. The rest of the recursive calls are made at full speed in each processor. The idea of using recursive procedures as the unit of parallelism for imperative programming languages, has been found in one other paper [Horowitz and Alessandro 83]. This is a theoretical paper with no implementation. Their concept also diers from the PRP-concept in the respect that all the recursive procedure instances are transformed into processes if the subproblem to be solved by this procedure instance is judged not to be 'small' (by some user dened test). As described above, in the PRP-concept, there exists only one recursive procedure instance transformed to a process in each processor. Parallel recursive procedures in functional languages has been described by Hudak [Hudak 3 86], [Hudak 91]. Both papers present the `Para-Functional Programming style'. This concept diers from PRP in many respects. First the programmer has to explicitly name the processors that the various functions shall be performed upon, either by absolute number or by relative address (left, right, up down, etc.) each time a procedure migrates as a process. In this `Para-Functional Programming Style', the parallel execution of function also relies heavily on the fact that no side-eects occur in functional languages. Both the ParAl and Haskell languages presented in these papers have been implemented but no performance gures are presented. 3.1 'Breadth rst' versus 'depth rst' Ordinary recursive procedure calls progress from-left-to right and depth-rst. That is not always the case in the PRP-concept. In order to start the parallelism in PRP, all procedure instances called from one procedure that are transformed to new processors as processes, start logically at the same time. That is, part of the recursion tree must be called and started breadth-rst. In g 1 procedure 'a' must call and start 'b', 'c' and 'd' in parallel, breadth rst. Then 'b' in turn must start 'e' and then perform the rest of its recursion depth-rst. We see a mixture of rst 'breadthrst' (the processes) and then afterwards 'depth-rst' (the rest of the recursive instances) traversal of the recursion tree in the PRP-concept. The return of these 'breadth-rst' procedure calls come in no particular order to their caller process. Since the PRP-concept can hide the actual number of processors available, the programmer does not necessarily know which procedure is performed 'breadth-rst' and which is done 'depth-rst'. The consequence for the programmer is that she or he can not use the return value from one recursive call when determining the parameters for the next recursive call on that level - i.e. when 'a' calls 'c' the return value from the call to 'b' with high probability has not been received by 'a' and hence can not used in any way when 'a' calls 'c' (or 'd',...). Only after waiting for the return of these processes are their return values available. For some (most ?) algorithms this arbitrary shift from 'breadth-rst' to 'depth-rst' does not matter, for some it does. PRP is only suitable for the rst class of problems. 3.2 Fan-out and the administration of processors The fanout for a recursive procedure is how many times it calls itself on the next recursion level. For some problems that might be xed (i.e. Quicksort has a fanout of 2). Other problems have a variable fanout that is determined by the problem (i.e. when traversing a graph, a recursive procedure will typically call itself for each edge going out from a node ). Other problems again can be suited to a predetermined fanout. One example is a set of 13 machines where we want to use PRP to nd the prime numbers of the rst say the 24 rst million integers. The parameters to the recursive calls to the 'ndprimenumber' procedure can then be set such that the rst machine searches for prime numbers among the rst 2 million numbers, the second machine between 2 and 4 million numbers ,...etc. We have then partitioned our problem into the same number of subproblems (minus one the root) as we have machines available. We do a full fanout on the problem from the root and on the second recursion level solve the resulting subproblems; maybe more eciently 4 with iteration. We sometimes can speed up the calculations if we signal to our PRP runtime-system the fanout before making the recursive calls. That has to do with an eective load balancing see next paragraph. Even though we often want to hide the number of processors available and their topology, we must have a policy for the administration of these machines. The policy chosen in this implementation, is that the top procedure instance owns all the machines. When the PRP runtime system is told the fanout before the recursive calls are made (by setting the system variable prp fanout), all the free machines owned by the calling machine are divided as equally as possible among the called machines. If a called procedure in another machine then receives the ownership of more than one machine (itself), this is repeated on the next recursion level (as with 'b' calling 'e' and itself in g. 1) until all free machines are allocated to a procedure instance transformed to a process. We said in the general description of the PRP concept that this procedure instance migrates to an idle processor. What actually happens is that the program is duplicated before startup in all the participating machines. When a procedure instance migrates to an idle processor, then a process in the idle processor that waits for a message with the parameters to the procedure instance is called. This process in turn calls the recursive procedure with these parameters. 3.3 Return values We noted in s subsection 3.1 that the procedures does not necessarily return in the same order as they are called. The return values upon return are stored in a system array prp ret[]. When a procedure calls itself recursively, then when the i'th such call returns, the returnvalues are placed in prp ret[i] (The implementation reported in this paper only supports the return of a single integer value; later implementations have lifted this restriction). There must be introduced a special primitive for waiting for the return of the 'breadthrst' procedure calls, prp wait. There are three conditions that an algorithm can specify as parameter to prp wait: WAIT-ALL, WAIT-FIRST or 'Condition'.'Condition' is a boolean expression that might involve the value of the results from the returning procedure (typically waiting for some return value to exceed a certain value). The other wait-parameters are straight forward:WAIT-ALL blocks until all procedures that has been called from this procedure has returned and WAIT FIRST blocks the calling procedure until the rst called procedure returns. All this waiting is a dummy operation when the procedure calls are performed as ordinary recursive procedures 'depth-rst'. 3.4 Additional concepts and reserved words When programming using the PRP concept, the programmer starts with an ordinary recursive algorithm. In the main part of the program she or he initiates the PRP-system by calling prp init() which takes the name of the available machines as parameters. He then calls the procedure in question with the prex prp call, ie. result = prp call hamiltonian(...); 5 Finally he shuts down the system by calling prp terminate(). This disconnects and frees resources used in the communication with the machines indicated in the prp init() call. The recursive procedure also has to undergo some changes before it can be used. First the procedure must be prexed with prp proc. (There can be only one prp proc procedure in the current implementation.) Then the procedure must decide how many calls it's going to make in parallel. Often this will be determined by the algorithm or the data at hand. If fanout can be set more or less as we want, it is often best to make as many parallel calls as possible, that is one call for each available machine (minus one - the root machine). The system-call free servers() can be used to nd how many available machines there are at this point. The system-variable prp fanout is then set to the desired fanout. It is important to set this variable, because the run-time system will use this value when it divides the ownership of the free machines among the calls - otherwise only one processor is handed out for each call. Finally, after the calls have been made, the program has to wait for the calls to return. As mentioned earlier, the program can wait for the rst call to return, it can wait for all the calls to return or it can wait for a call which makes a condition true. In case of the rst and last, the system-variable prp ret can be used to nd which calls returned. The return values can be found in the system-variable prp return. The algorithm can then use these values in whatever calculation and nally return a value to the caller. In case of several processors returning before a condition is satised, the table prp returned can be checked to nd out who has reurned. 4 An example program - the Hamiltonian Path in a Graph The problem solved in this example is to determine whether or not there exists a Hamiltonian Path in a graph G - that is a path from node to node along the edges visiting each node once and only once. This is a NP-complete problem. The graph is represented with an integer array G, where G [i][j] = 1 if node i has an edge to node j - 0 otherwise. As we can see in the program, an extra iteration is used to count the number of not investigated ('unseen') neighbor nodes in lines 3 to 5 in the recursive procedure 'hamiltonian'. This is the fanout for this procedure instance and the system variable prp fanout is set. Then each 'unseen' neighboring node is called recursively and the highest return value, i.e. the longest possible distance from this current node through its 'unseen' neigbours is picked up and a distance one longer than that is returned. The following program example was run on a graph containing 12 nodes resulting in a recursion tree with 191 989 procedure instances and a graph containing 14 nodes with more than 21 million procedure instances. The program text demonstrates in bold-face the additions made to an ordinary recursive solution transforming it to a PRP-solution. /* hp.prp PRP-program to determine if Hamiltonian Path exist in graph */ 6 int G [NODES][NODES]; /* G[i][j] == 1 if edge from node i to node j in G == 0 otherwise */ int prp_proc hamiltonian(int node, int seen[]) f int i, call_ = 0, max_num, fo = 0; seen[node] = TRUE; for(i = 0; i < NODES; i++) if(G[node][i] == 1 && seen[i] == FALSE) fo++; prp_fanout = fo; for(i = 0; i < NODES; i++) if(G[node][i] == 1 && seen[i] == FALSE) f prp_call hamiltonian(i, seen); call_nr++; g prp_wait(WAIT_ALL); g for(i = call_nr - 1; i >= 0; i) if( prp_return[i] > max_num) max_num = prp_return[i]; seen[node] = FALSE; return ( max_num + 1); int main() f int number; int seen[NODES]; /* Init. table */ for(n = 0; n < NODES; n++) seen[n] = FALSE; prp_init(4, "machineX", "machineY", "machineZ", "machineW"); number = prp_call hamiltonian(0, seen); prp_terminate(); g if(number == NODES) printf("Graph G has a Hamiltonian Path"); 5 Test results Figure 2 and gure 3 report the results of the test runs. The timing could not be made accurate because they were performed on machines with a varying load due to other users logged on to these machines. We see a better than 50 % CPU utilization, a speedup factor proportional to the number of processors, and roughly loosing less than 40% of the theoretical peak performance. The slight drop in eciency observed from 9 to 11 machines in these gures can be explained by adding 2 less powerful machines (Sparc10) to the pool of available processors (Sparc 20). 7 Seconds CPU execution time − 14 nodes graph 700 600 500 PRP−solution Ideal solution 400 300 200 100 0 3 1 7 5 9 11 Number of machines Figure 2: Execution times as a function of the number of machines used running the Hamiltonian Path program CPU utilization Percent 100 90 14 Nodes 12 Nodes 80 70 60 50 40 30 20 10 0 1 3 7 5 9 11 Number of machines Figure 3: Mean CPU utilization as a function of the number of machines used when solving the 12 node and the 14 node Hamiltonian Path problem These results must be judged very satisfactory since Hennesey and Patterson [Hennesey and Patterson 94] report from 67% down to 1% of the theoretical peak performance (with a mean of approximately 20%) on the 7 reported test runs on ve dierent parallel machines. 8 6 Strengths and weaknesses of PRP Because the responsibility of communicating and scheduling the workload is distributed, there is no reason to get a drop in processor eciency as additional processors are added. While other implementations we have studied often have a clear decreasing processoreciency when adding new processors, this could not be observed in the PRP-implementations (at least not within the limited number of machines we had available). 6.1 Global variables and pruning of the search tree PRP does not support global variables. This of course is a disadvantage, because some of the algorithms (i.e. the Traveling Salesman problem) use global variables for pruning the search tree for search of an optimal solution. The reason for not having global variables is the loosely coupled, distributed architecture PRP can be implemented on with each processor having a copy of the program with all its variables. Alternatively, using RPCcalls to read and set global variables is thought be too inecient and a common resource responsible for the globals would soon become a bottleneck. A problem with the implementation of PRP today is that it can lead the programmer to believe that she or he can have global variables - i.e. the the array 'G' in the program example. These variables, however, will become separate instances for each processing node (i.e. one for each subtree in g. 1), thus an assignment of the 'global' variable in one processor would not inuence the 'same' variable in some other processor! Each processing node would have its own local 'global' variable. The eect of many copies of global variables if they are used for optimization (but not for the correctness of the algorithm !) is an open research question. Our guess is that in most cases the eect of each subtree in each processor having its own local optimum in a local 'global' variable , and then later combining these optima upon return to a true global optimum in the root, would be almost as eective as having one shared global variable throughout the whole computation. This reduces message passing trac to a minimum. In an Ethernet implementation of PRP, this would in real time be a much faster solution. But again, this is an open research question we are now starting to look into. 6.2 Implementation, the pre-processor and the PRP runtime system The process of transforming the sequential, recursive solution to a PRP solution is illustrated in g. 4. As seen from this gure and the program example, the target and implementation language is 'C'. The user starts with an singe-processor, recursive solution ('appl.c'). The user then manually makes the changes necessary to include prp-functionality (prp call, prp wait,... ). The resulting le ('appl.prp') is then sent to the PRP preprocessor which, without any user interference, does: 1. The reserved prp words and the prp recursive procedure are recognized. Two programs are constructed, an 'admin.c' for running with user I/O on the machine that starts the execution (the root), and 'server.c' for the other machines. Also the 9 appl.c User changes The user changes original one−processor solution and starts the PRP−preprosessor appl.prp The preprocessor makes two files, links in runtime system and generates two executable files adm and server PRP pre− processor adm.c server.c header files’ C compiler adm runtime.c C compiler server Figure 4: From ordinary sequential recursive solution to PRP-solution prp procedure is made in two versions in these in these two programs, one for acting as a process and one for acting as an ordinary recursive procedure stripped of all its prp functionality. The parameters to the recursive procedure is also recognized, and a parameter-struct is made for packing/unpacking and sending the packet of parameters by RPC-calls on the Ethernet. 2. The two programs 'admin.c' and 'server.c' are sent through the C-compiler and linked to the prp runtime system for making the two executable les 'server' and 'admin'. Finally, the user starts the parallel computation by running 'admin'. As an initiation task, 'admin' copies 'server' to all the other machines participating in this computation. The 'server'-programs then start as processes on these machines waiting for a RPC-call. 'Admin' nally starts the recursive computation as specied by the user. 7 Further work The PRP-concept has been implemented on the Intel Paragon machine and on a four-CPU SMP machine (SGI Challenge) with good performance results. Preliminary test runs show a better than 50 % CPU utilization on the traveling salesman problem where only local copies in each processing node, as described above, of the shortest-distance-so-far has been used for pruning the computation.. Two new master of science thesis on renements to the original solution, one porting to a heterogeneous net of computers and the other on porting PRP on top of the PVM library, has been started. 10 8 Conclusions A paradigm, PRP, for writing parallel programs using recursive procedures in an imperative programming language has been introduced and an implementation and some test results has been reported. Since PRP is an addition to a well known programming paradigm, we will claim that it is relatively easy to use mainly due to the lack of need for dicult synchronization. We also claim that it scales well with the number of processing nodes available, that it hides the underlying topology of the processor interconnect from the user and that it is easy to achieve a satisfying speedup compared with other parallel programming paradigms. PRP is however not the solution to every parallel programming problem. We rst demand that the problem has an eective recursive solution with a fanout greater than one, and also that such a solution does not rely on a strict depth-rst search of the sequential recursive procedures. The use of global variables are also not supported directly, but as indicated in this paper, copies of the global variables in each machine might in some cases be almost as good as true global variables. To summarize, we believe that it is easy to formulate eective parallel solutions in the PRP-paradigm to a wide class of problems such as search problems, Graph problems and among these also the class of NP problems. 9 Acknowledgment We would like to thank Stein Jørgen Ryan, Yan Xu and Arne Høstmark for their work on the PRP concept an to their valuable comments to earlier drafts of this paper. References [Carreiro 89] Carreiro N. and Gelernter D. How to write Parallel Programs: A Guide to the Perplexed. ACM Computing Surveys, Vol 21, No. 3, Sept. 1989 page 323-357. [Carreiro 89b] Carreiro N. & Gelernter D. Linda in Context. Communication of the ACM, Vol 32, April 1989. [Foster 94] Foster Ian T. Designing and Building Parallel Programs . Addison-Wesley, Reading, Mass. 1994. [Geist et al. 94] Geist A., Beguelin A., Dongarra J, Jiang W., Manchek R. and Sunderam V. PVM: A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press 1994. [Hennesey and Patterson 94] Hennesey J.L. and Patterson D.A. Computer Organization and Design p.639-640, Morgan Kaufman Publishers, San Mateo, Calif. 1994 11 [HPFF 94] High Performance Fortran Forum High Performance Fortran Language Specication, ver. 1.1. Rice University, Huston , Texas. November 10, 1994. [Horowitz and Alessandro 83] Horowitz E & Alessandro Z. Divide-and-Conquer for Parallel Processing. IEEE Transactions of Computers. Vol. C-32, Nr 6 June 1983. p 582-585. [Hudak 86] Hudak P. Para-Functional Programming. IEEE Computer. August 1986. p 60-70. [Hudak 91] Hudak P. Para-Functional Programming in Haskell. in Szymanski B. K.(ed.)Parallel Functional Languages and Compilers. ACM Press, New York 1991 [Maus 78] Maus A. Språkprimitiver og en logisk trestrukturert multiprosessor- Idenotat nr. 1 og 2 (in Norwegian) Norsk Regnesentral 1978 [Pacheco 95] Pacheco P. S Programming Parallel Processors Using MPI. San Francisco, CA, Morgan Kaufman (to appear) 1995. [Aas 94] Aas T. Parallelle Rekursive Prosedyrer. Master Thesis (in Norwegian) Dept. of Informatics, Univ. of Oslo, 1994. 12