Parallel Block LU Factorization – An Experience in Stateless Parallel Processing Weijun He (weijunhe@weijunhe@temple.edu) Yuan Shi, (shi@temple.edu, faculty supervisor) CIS Department Temple University Philadelphia, PA 19122 Abstract. This paper presents an experience in parallelizing a Block LU factorization algorithm using a Stateless Parallel Processing (SPP) framework. Compared to message passing systems such as PVM/MPI, SPP allows a simpler dataflow computation model that enables highly effective implicit parallel processing. This allows a range of benefits that were not possible in direct message-passing systems: programming ease, automated fault tolerance and load balancing with easy granularity control. The SPP architecture also promises high performance featuring automatic formation of SIMD, MIMD and pipeline processor clusters at runtime. This paper uses a LU factorization algorithm to illustrate the SPP programming, debugging and execution methods. We use timing models to evaluate and justify parallel performance. Timing models are also used to project scalability of the application in various processing environments in order to establish relevance. The reported experiments are conducted using Synergy V3.0 on a cluster of 4 Linux machines running RedHat9. 1 Introduction Difficulties in parallel programming are rooted in two areas: a) inter-process communication and synchronization; and b) performance optimization. For multiprocessor systems without a shared memory (such as cluster of COTS processor), the communication and synchronization programming difficulties increase exponentially. Identifying the “best” processing granularity for optimized performance is another challenge. Great performance losses will result either the granularity is too large or too small. Unfortunately, for direct message-passing systems, processing granularity is inter-woven in the task design that cannot be easily adjusted. In addition to programming difficulties, multiprocessor systems also face a serious availability challenge in that higher processor counts adversely impact the application availability since any single processor crash can halt the entire parallel application. Stateless Parallel Processing (SPP) concerns with a processor design and parallel programming discipline that facilitates automated fault tolerance and load balancing. In particular, SPP eliminates the need for a single uniform memory space while allowing automatic program/data partitioning with high performance. An SPP system can 1 automatically form SIMD, MIMD and pipelined processors at runtime. It is free from the difficult cache coherence problems. Therefore, SPP is ideally suited for using multiple COTS processors. Implementation of SPP system requires a data access model using tuple space [4]. The tuple space data access model is in theory identical to the dataflow model. It allows implicit parallel processing thus simplifies programming. There have been numerous efforts exploiting the tuple space programming paradigm and implementation techniques. In this paper, we focus on a server-centric implementation (Synergy) that offers fault resilience to multiple worker faults but is known to have larger overhead to application inter-process communication when compared to direct message-passing systems. A full SPP implementation will eliminate this overhead and offer fault resilience to multiple processor and network failures. In this paper, we use a Block LU factorization algorithm to illustrate the programming and performance tuning features of SPP. We use timing models to guide our program design and to predict the achievable timing results. We use the measured timing results to validate the models. Finally we examine the scalability of this application under various processing environments. 2 Stateless Parallel Processing (SPP) For parallel programming, SPP requires a separation between functionality programming and performance programming (including fault tolerance). A SPP application consists of an interconnected network of stateless parallel programs. A stateless parallel program is any program that does not contain or generate persistent global state information. For example, a program uses a specific IP address and port number to communicate with another program is NOT stateless. Similarly, a program that spawns programs on multiple processors is also NOT stateless for it creates a persistent parent-child relationship. Typically, it is performance-programming practice that produces non-stateless codes. A SPP application contains only two types of programs: master and worker. A master is any program that distributes working assignments and assembles the results. A worker program reads a working assignment and returns complete result. Using the tuple space data access model, each SPP implementation represents a complete dataflow computer. An SPP runtime system exploits the implied data dependencies in the parallel programs. It optimizes processing efficiency by minimizing communication overhead and automatic formation of SIMD, MIMD and pipelined processor clusters at runtime. The user can further optimize performance by tuning the processing granularity by 2 leveraging the stateless coding style. 3 Synergy Parallel Programming and Processing System Synergy V3.0 is a research parallel programming and processing system in use for graduate and undergraduate computer science education at Temple University since 1990. It was also successfully used in a number of research projects in IBM T.J. Watson Research Center, Wharton School of University of Pennsylvania and the Center for Advanced Computing and Communications at Temple University. Synergy uses a server-centric tuple space mechanism [5] for inter-parallel program communication. Different from Linda’s compiler-based implementation [4], Synergy’s tuple space is implemented by runtime created servers. Synergy programs can be written in any sequential programming language and compiled with a link to the Synergy runtime library. A Synergy parallel application consists of a set of SPP programs and a configuration specification that links the programs into a logical parallel application. The Synergy runtime system interprets the configuration specification, activates the specified components and provides functions for monitor and control. It contains debugging support using concurrent activation of multiple gdb debuggers. It also contains automatic fault tolerance supporting multiple worker processor faults. Currently, Synergy V3.0 supports RedHat Linux9 and SunOS 5.9. 4 Synergy application programming interface (SAPI). Synergy runtime library contains the following calls: tsd=cnf_open(“TupleSpaceName”) – opens a named tuple space object with a returning handle. cnf_term() – terminates a running program. cnf_tsput(tsd, TupleName, Buffer) – insert a named tuple into an opened space object. cnf_tsget(tsd, TupleName, &Buffer) – fetch a matching tuple and remove it from the given space object. cnf_tsread(tsd, TupleName, &Buffer) – read the contents of a matching tuple in the given space object. There are three environment variable fetching calls: cnf_gett() – get the value defined by “threshold=” clause in application specification. cnf_getf() – get the value defined by “factor=” clause in application specification. cnf_getP() – get the number of processors. 3 These values facilitate grain size adjustments. Every Synergy program is separately compiled using a host language compiler and Synergy runtime library; and must use SAPI in order to achieve parallel processing benefits. 5 Synergy Parallel Application Configuration Specification (SPACS) The user is required to specify a parallel application by linking multiple compiled parallel programs. The following is the specification for Block LU factorization application (BLU.csl): configuration: BLU; m: master = BLUMaster (threshold = 1 debug = 0) -> f: problem (type = TS) -> m: worker = BLUWorker (type = slave) -> f: result (type = TS) -> m: master; The highlighted words in this specification correspond to the cnf_open statements in the parallel programs. The “=” signs in the M-clauses are used to equate the logical components to the physical binary file names in the $HOME/bin directory. This specification is processed by the Synergy runtime system to generate dynamic tuple space daemons and a customized application control console responsible for monitoring and control the entire application (DAC, Distributed Application Controller). 6 Synergy Runtime Environment (SRE) Synergy was designed to exploit resource-sharing as opposed to cycle-stealing. The idea is that all users should share the same pool of resources in any way legally permitted. To this end, it does not use SSH (Secured Shell) to activate user programs. Every user is required to run a login-based command interpret daemon (CID) to gain CPU access to a remote machine. These daemons inherit all legal restrictions of the login ID thus avoid any security related concerns. A port mapping daemon (PMD) manages the dynamic ports for all users on a single host. In a NFS (Network File System), a single shared copy of binaries (and link library) is sufficient for all cluster users. On multiple distributed clusters without a shared file system, multiple copies of Synergy binary and consistent path naming are necessary. Each user maintains a host list as the resource pool. The state of this pool defines resource availability for the next parallel run. Activating CID will automatically launch 4 PMD if it is not already running. Multiple activation of PMD on the same node will not affect the correctness of operation. The following commands manages the host list states in SRE: cds [-a] [-f] – Checks daemon states. The default is checking only active hosts. “-a” lists inactive hosts as well. “-f” saves the output to a file named .snghosts. sds – Starts CIDs on all current active hosts. kds – Kills all daemons on current active hosts. addhost – Adds a host to the host list. delhost – Deletes a host from the host list. chosts – Chooses hosts to activate or deactivate. Multiple cluster users can share the same processor pool at the same time. The following commands execute, debug and monitor a Synergy application: prun [–debug] SCF_name [-ft] – This activates a parallel application named SCF_name. “-debug” option prepares a virtual dataflow computer ready for debugging using multiple gdb sessions. “-ft” option activates the worker processor fault tolerance. The default is no fault tolerance to facilitate debugging. pcheck – This command can monitor all applications started by the same login ID. It monitors at both application level and process level and can also be used to terminate either a process or an entire application. 7 Synergy Programming and Communication The direct message passing systems allows the best possible communication performance deliverable by the low-level networking systems [3]. However, for load balance and fault tolerance, indirect communication is necessary since the target recipient may not be present or appropriate for work due to dynamic status changes. Synergy uses indirect communication that each tuple transmission must go through three phases: sending, matching and receiving. The inevitable overhead can be offset by load balancing, automatic SIMD, MIMD and pipeline formation and fault tolerance benefits [5]. Similar to message passing systems, Synergy programs contain initialization, communication and termination library calls: Tsd1=cnf_open(“problem”, 0); Tsd2=cnf_open(“result”,0); // Initializing “problem” space. // Initializing “result” space strcpy(tpname,”A0”); 5 cnf_tsput(tsd1, tpname, buffer, buf_size); // Communication: put a tuple “A0” strcpy(tpname,”B*”); len = cnf_tsread(tsd2, tpname, buffer2, 0); // Communication: read a tuple “B*” … cnf_term(); // Termination In a debugging session, a virtual dataflow computer is created by launching appropriate daemons. This allows user to step through instruction-by-instruction using tools like gdb. 8 LU Factorization Dense linear systems are commonly solved using LU factorization. Block methods in matrix computations are widely recognized as being able to achieve high performance on parallel computers [1]. In solving a dense linear system A.X=B, we first factorize A into L.U where L is a lower triangular unit matrix and U is an upper triangular matrix. There are four major types of computations. Here we omit the permutation matrix P for simplicity. LU factorization for diagonal sub-matrices Solve lower triangular unit sub-linear systems like LZ=subA Solve upper triangular sub-linear systems like WU=subA Matrix product for updates like subA = subA –W.Z LU LZ=subA …………. . . . WU = subA subA=subA-W.Z Figure 1. Four tasks for LU Factorization 9 Parallel LU Factorization The parallel LU Factorization application contains one master and one worker. The 6 master maintains a task road map according to the sequential dependencies. The worker performs given task (there are four types) and returns corresponding result. Worker is activated on multiple hosts, including the host that runs the master. Assuming total number of tasks is N, the parallel framework is as follows: Master Initialization: put the ready-to-compute tasks to the task pool For I = 1 to N Receive a completed task from a worker Refresh the task roadmap While a new task is ready-to-compute Put the task in the task pool Workers Get any ready-to-compute task from the pool Do the computing Return the completed task to the Master 10 Optimization for Parallel Block LU Algorithm We denote the sub-blocks as submat[i][j][status], here the status = O/U/F: O: Original, U: Updated, F: Finished. The dependency graph for the sub-tasks for a 3 x 3 block factorization is as follows: (submat[1][i][O] = submat[1][i][U], submat[i][1][O] = submat[i][1][U]) 7 [3][3][O] [1][3][F] [3][1][F] [1][3][U] [3][1][U] [1][1][F] [2][1][U] [1][2][U] [1][1][U] [3][3][U] [2][1][F] [1][2][F] [2][2][O] [3][2][U] [3][2][F] [2][3][U] [2][3][F] [2][2][U] [2][2][F] Figure 2. Dependency Graph for Block LU Factorization (3 x 3 blocks) Each task involves either a matrix-matrix product or a upper(lower) triangle solver. To optimize parallel execution time, the task distribution should ensure the minimal waiting time between tasks. Figure 2 shows that this application does not render itself well for parallel processing. Many paths are sequential in essence. However, it contains SIMD (multiple matrix-matrix products (L or U solvers) can run in parallel), MIMD (matrix-matrix products, L and U solvers can run at the same time) and pipelined (the wave-front dependency) processor forms. 11 Synergy Implementation Roadmap Status Matrix We use a K x K matrix to record the update history of K x K blocks. Originally each cell is assigned value 0. When we update the according sub-matrix, the value increases 1. For Block LU factorization for cell (i.j), when it is done, value(i,j) = min(i,j)+1. For a 3 x 3 grid, the serialized execution path of the dependency graph is as follows (many blocks can execute in parallel): LU (0,0) LZ(0,1) LZ(0,2) WU(1,0) WU(2,0) 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 Update (1,1) Update(1,2) Update(2,1) Update(2,2) LU(1,1) 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 2 1 1 0 0 1 0 0 1 10 10 1 1 0 1 1 1 LZ (1,2) WU(2,1) Update(2,2) LU(2,2) 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 2 31 1 2 2 8 [3][3][F] Figure 5. Typical Serialization Path for a 3x3 Blocks 9 struct msg { int type; int x; int y; int row; int col; float mat1[M][M]; float mat2[M][M]; }; struct res_msg { int type; int x; int y; int row; int col; float res_mat[M][M]; }; Figure 6. Task and corresponding result structures Where x, y : the coordinates where the result sub-matrix begins. row,col: the size of the result sub-matrix. type: =0, Function: LU factorization for the diagonal sub-matrix =1, Function: Solve lower triangular unit sub-linear systems like LZ=subA. mat1: input L ; mat2: input subA =2, Function: Solve upper triangular sub-linear systems like LZ=subA. mat1: input U; mat2: input subA =3, Function: Matrix-matrix product result = mat1 * mat2 =4, Function: Termination of all tasks. 12 Flow Charts Putting all these together, the next flowcharts show our goal is to minimize synchronization. The roadmp matrix is Pstatus. 10 Master: Solve LU(0,0), Update Status put LZ(0,1:NBLOKCS-1), WU(1:NBLOCK-1,0) to Problem Space(PS) Receive a tuple (i,j) from result space, update sub-matrix and status Type=0 Type=1 Type=2 Type=3 (LZ,i<j) (WU,i>j) MatMul((i,k),(k,j)) (LU,i=j) k=i+1:NBLOCKS-1 k=i+1:NBLOCKS-1 k=i+1:NBLOCKS-1 if(PStatus(i,k)==1) if(PStatus(k,i)==i+1) if(PStatus(j,k)==i+1) put LZ(i,k) to PS put MatMul((k,i),(i,j) if(PStatus(k,i)==1) to PS PStatus(i,j)== put MatMul((i,j),(j,k) min(i,j)? to PS put WU(k,I) to PS N Y i=j i<j I==NBLOCKS-1 Solve LU for the last diagonal sub matrix i>j I<NBLOCKS-1 Put LU(i,i) to If PStatus(i,i)==i+1 If PStatus(i,i)==i+1 PS Put LZ(i,j) to PS Put LZ(i,j) to PS Put the termination tuple to PS exit Figure 7. Master Flowchart 11 Workers: Receive a tuple from PS Recv_length==4? Put back the Type=0 Type=1 Type=2 termination tuple Type=3 Solve Solve Solve Solve MatMul( LU(i,i) LZ(i,j) WU(I,j) (i,k), (k,j)) exit Put the result to the Result Space(RS) Figure 8. Worker Flowchart The above parallel system contains two overlapping parts: Computation and Communication. Using high performance network switches, it is possible to incur concurrent communications. Master (Tuple Space+Data+Roadmap), Work0 Work1 Switch Work3 Work2 Figure 9. Overlapping Computing and Communications 12 13 Performance Analysis In this section, we use timing models to evaluation the potentials of the designed parallel application. Timing model is performance analysis methodology that uses holistic time complexity models and itemized instrumentations to reveal a parallel application’s performance characteristics under different processing environments [2]. Let: n : size of the matrix m : number of blocks k : block size, n m k w : CPU power (algorithmic steps per second) : network speed (bytes per second) : number of bytes in each word p : number of processors We notice that in our program model, we actually eliminate the synchronization mostly and also the computation and communication overlap at the most time. So the parallel execution time is determined by the bigger of the two, Tcomp and Tcomm. That is, T par max( Tcomp , Tcomm ) Tcomp 2n 3 3wp And Tcomm Acomm * / where m Acomm {6(m q) 3(m q) 2 } k 2 q 1 m 1 (6i 3i 2 )k 2 i 1 m 3 k 2 o( m 3 k 2 ) mn 2 o(mn 2 ) Therefore, Tcomm mn 2 / o(mn 2 / ) T par max( Tcomp , Tcomm ) max( 2n 3 mn 2 2n 3 n 3 , ) max( , ) 3wp 3wp k Tseq 2n 3 3w 13 Speedup Tseq T par 1 2k min( p, ) 1 3w 3w max( , ) p 2k The above formula illustrates that the speedup depends on p, w, k , - The bigger p, k and , the better the speedup - The smaller w , the better the speedup And, the best block size is: k 3 pw 2 Suppose that w 10 7 , 10 6 , k 500, p 4, m 5, 4 , we can change the value of 3.5 3.5 3.0 3.0 2.5 2.5 Speedup 2.0 1.5 1.5 1.0 1.0 0.5 0.5 4.0 Speedup 5.0 3.0 2.5 1.0 0.5 2 1.8 1.6 1.4 1 network speed (u, 10^6) 3.5 2.0 1.5 1.2 10 14 18 22 26 30 node computing power(w,10^6) 0.8 0.2 6 0.6 0.0 0.0 3.0 2.0 1.0 Block Size 900 1000 800 700 600 500 400 300 200 0.0 100 Speedup 2.0 0.4 Speedup each variable separately to see their influence to the speedup. 0.0 2 3 4 5 6 7 8 9 10 number of processors,p Figure 10. Block LU Factorization Scalability Analysis 14 14 Experimental Results Figure 11 shows actual parallel running results. Sequential matrix size block size seq-time(s) Parallel Mflops #CPU time(s) Mflops Speedup 500 200 0.66 125.62 3 0.69 120.61 0.96 800 300 3.75 91.14 3 2.62 130.10 1.43 1200 400 15.03 76.65 3 7.63 150.94 1.97 1500 300 32.21 69.85 3 12.24 183.80 2.63 1600 400 33.02 82.71 3 15.07 181.25 2.19 2000 500 72.70 73.36 3 34.95 152.61 2.08 1000 250 6.00 111.20 4 3.24 205.96 1.85 1000 300 7.91 84.30 4 4.77 139.86 1.66 1200 300 13.39 86.03 4 5.89 195.66 2.27 1600 400 33.02 82.71 4 13.73 198.94 2.41 Figure 11. Computation Timing Results We observe that for N=1600 we achieve the best speedup, resulting in 184 MFLOPS using three processors. It also shows that according to timing models, if w improves too much there will be no chance for speedup. 15 Conclusion This report documents the computational experiments conducted using Synergy V3.0 for a Block LU factorization algorithm. It shows that it is easier to program than direct message passing system. It also shows that it is possible to achieve positive results using slow networks when the processor power is small. Timing model analysis reveals that this positive result can go away if we use BLAS optimized library codes in the parallel core. This is because BLAS library codes will improve w measure by 10 folds. References 1) Matrix Computation, 3rd Edition, Gene H. Golub, Charles F. Van Loan, The Johns Hopkins University Press 2) Parallel Program Scalability Analysis, Yuan Shi, International Conference on Parallel and Distributed Computing, October 1997, pp.451-456 3) The Message Passing Interface (MPI) Standard. http://www-unix.mcs.anl.gov/mpi/ 4) S. Ahuja, N. Carriero and D. Gelertner, “Linda and friends,” IEEE Computer,26-32, August 1986. 5) Getting Started in Synergy, Yuan Shi, Online Documentation: http://spartan.cis.temple.edu/shi/synergy. 15