Super Linear Speedup for Parallel Givens Algorithm on a Network of Workstations (NoW) M. J. Zemerly, K.M. S. Abdulla and M.H. Shariff Department of Computer Engineering Etisalat College of Engineering PO Box: 980, Sharjah UNITED ARAB EMIRATES http://www.ece.ac.ae Abstract: - This paper describes a parallel direct linear solver algorithm based on the Givens rotations on a network of workstations (12 Ultra 10 Sun machines). PVM (Parallel Virtual Machine) was used as the communication platform. The parallel algorithm uses data decomposition and pipeline techniques. Results obtained for large dense matrices (1024 and 2048) showed superlinear speedup. The superlinear speedup increases as the size of the matrix increases. This was attributed to cache effect as the algorithm itself is not well balanced and has a sequential part. Key-Words: - Cluster computing, PVM, parallel Givens linear solver, superlinear speedup. 1. Introduction Direct linear solvers for dense matrices have been used in many applications . These include computer simulation, electronic circuit analysis, structural analysis, chemical engineering, economic modelling, to name but a few. Most applications involve solving an extremely large number of linear equations. This can take a considerable amount of time to compute on the most powerful sequential computer available to date. One way to reduce the time required to solve these equations is to use a parallel processing to do the task, where the calculations are divided between the processors. Since each processor now has less work to do, the task can be finished more quickly. This is similar to a supermarket having more than one checkout counter. Each cashier has fewer customers to serve, so the checkout time is much faster. Recently, other than the low cost and availability reasons, the power of clusters of workstations became evident with the advance in chip technology which made the speed of single processor improve dramatically. Communication libraries such as PVM (Parallel Virtual Machine)  and MPI (Message Passing Interface)  facilitate the use of clusters of workstations. The selection of the parallelisation technique is very important and is the first step in parallelisation process. As a result of this step, the data space and the algorithm are partitioned. Despite the partition, which reduces the execution time, communication and synchronization overheads are introduced [1,8]. Due to this overhead and the possibility of having a sequential part in the parallel algorithm, it is very difficult to obtain decent speedups from parallel algorithms. The computation time can be reduced by selecting an appropriate parallelisation strategy which yields good load balancing. An efficient technique used for parallelisation is functional pipelining which partitions the control flow into stages and share the same data flow. This is most effective if all the stages perform a similar amount of computation such that all stages of the pipeline are busy . In this paper a linear solver based on Givens rotations is selected and parallelised using a normal row data decomposition and a pipeline technique  which was used on a transputer based machine. The results obtained from the parallel algorithm shows a super linear speedup effect for large matrix sizes. 2. Direct Givens linear solver There is no denial that linear solvers are so essential for many applications such as the fields of engineering, design and simulation . Linear solvers belong to the computationally intensive class of algorithms that can be represented in the following form: Ax =b, where A is the nonsingular square matrix, and b is the right hand side vector and x is the vector of unknowns. The Givens transformation is defined by a 22 rotation matrix : c s s c G = for their rows (except the first row for each processor) in parallel. The processor holding the row of the column to be eliminated passes it to its neighbour to eliminate its column and then passed in a pipeline fashion through other neighbours until all the columns are eliminated (see Figure 1). The last processor keeps the lines in a full matrix to be used in the sequential back-substitution stage later on its own. where c2 + s2 = 1. A Givens rotation is used to eliminate the elements of a vector or a matrix as follows: c s a r × = s c b 0 Where c a / a 2 b 2 and s b/ a 2 b 2 Givens does not require pivoting, which is difficult to parallelise, and is known to be numerically stable and inherently more accurate than other methods (Gaussian elimination) [1,5,6]. Givens does, however, require more computation than Guassian elimination, with complexities of (4/3) N³ + O(N2). vs (1/3) N3 + O(N2) respectively. The Givens algorithm for solving a linear system with N equations can be decomposed into two computational stages. The first is the triangulation of the initial matrix; this stage is represented by the execution of the elimination block N times. The second stage is the back-substitution block that solves the triangular matrix. The complexity of back-substitution is O(N²) compared with the triangulation phase of O(N³). In this paper we concentrate on parallelising only the triangulation stage, described in Section 3, since the other stages are not as computationally demanding. 3. Parallel method Givens using pipeline Initially, block-row data decomposition, which divides the matrix horizontally and assigns adjacent blocks of rows to neighbour processors, is used. Firstly all processors eliminate the required columns Figure 1: Data decomposition and pipelining for parallel Givens As can be seen the parallel algorithm is not well load balanced as the last processor work the hardest and the first processor works least, as it finishes execution and becomes idle as soon as it eliminates its N/p columns. Note that sending in PVM is asynchronous and receiving is synchronous. Hence the last processor has to wait after finishing the elimination of the first column of the first line to propagate through all processors and this is the time required to fill the pipeline. After that all the workers operate in a pipeline fashion until they finish one after the other and become idle and leave the last processor working on its own to eliminate the remaining columns. The last processor also find the solution vector using back-substitution. Figures 2 and 3 show a screen dump from XPVM  of the space-time and utilization graphs of the algorithm for 6 processors and for matrix size of 512. It can be seen from the figures that the load balancing is not so good and has a stepping effect. This may be overcome by distributing more matrix lines to the first processors and giving fewer lines to the last processor. Figure 2: XPVM space-time graph of the algorithm Figure 3: XPVM utilization graph of the algorithm 4. Results The results presented here are based on 3 performance measures: execution time, speedup and efficiency . 4.1 Execution time: The results have been obtained for two large matrix sizes (1024 and 2048). Table 1 shows the execution times (minimum of 10 runs was taken) for the two matrix sizes and Figure 4 shows a graphical representation of the results in Table 1. 4.2 Speedup: The speedups (Sp = Ts/Tp, where Ts is the sequential time and Tp is the parallel time for p processors) obtained for both matrix sizes and for up to 12 processors are shown in Table 2. Figure 5 shows the speedup graphical representation results. Processors 1024 1 110.53 2 34.03 3 25.62 4 21.3 5 18.89 6 17.21 7 16.09 8 15.08 9 14.41 10 13.94 11 14.19 12 13.83 2048 902.98 262.3 192.47 154.95 131.94 116.42 104.83 96.48 88.7 83.74 78.88 74.74 Table 1: Execution times Execution time (secs) 1000 1024 2048 100 that the superlinear speedup occurs in all the data points where the efficiency is greater than 1. That is for the matrix size of 1024 up to 5 processors and for 2048 matrix size up to 8 processors. It can be seen that as the number of processors increases the effect of overheads from communication and synchronization increases and the performance of the parallel algorithm decreases Processors 1024 2048 2 1.62 1.72 3 1.44 1.56 4 1.30 1.46 5 1.17 1.37 6 1.07 1.29 7 0.98 1.23 8 0.92 1.17 9 0.85 1.13 10 0.79 1.08 11 0.71 1.04 12 0.67 1.01 10 1 2 3 4 5 6 7 8 9 10 11 12 No. of processors Figure 4: Execution times representation 1024 2048 3.25 4.31 5.19 5.85 6.42 6.87 7.33 7.67 7.93 7.79 7.99 3.44 4.69 5.83 6.84 7.76 8.61 9.36 10.18 10.78 11.45 12.08 Table 2: Speedup results 14 Speedup 12 1024 2048 Table 3: Efficiency results 2 1024 2048 1.5 Efficiency Processors 2 3 4 5 6 7 8 9 10 11 12 1 0.5 10 8 0 6 2 3 4 5 6 7 8 9 10 11 12 4 No. of processors 2 Figure 6: Efficiency graph 0 2 3 4 5 6 7 8 9 10 11 12 No. of processors Figure 5: Speedups graphic representation 4.3 Efficiency The efficiency (Ep = Sp/p) of both matrix sizes is given in Table 3 and a graphical representation is shown in Figure 4. As can be seen from this figure 5. Discussion From the results of the Givens parallel algorithm shown previously, two effects are quite clearly observed here. For any given number of processors applied to the problem, the speedup increases with the size of the problem. For smaller problems the processors spend a large proportion of the time communicating with other processors in order to share relevant information. However, as the problem size gets larger, the communication time (although increases with the problem size) becomes relatively less significant compared to the time required to do the computation. Speedup is no longer bound by communication but by computation and begins to approach the ideal linear speedup (speedup factor equal to the number of processors used) predicted by the number of processors used. The second effect demonstrated is that speedup varies with number of processors for a given problem size. As the number of processors increases, so does the speedup but beyond a certain number of processors the speed-up saturates and will not increase further (as can be seen from Figure 5 for matrix size 1024, not so clear for size 2048). This remains true irrespective of the number of extra processors that are applied to the problem. It can be seen from the speedup and the efficiency figures that both matrix sizes have superlinear speed up (when the speedup factor is greater than the number of processors used) associated with both of them. Cache effect is a common cause of superlinear speedups . The cache is used to accelerate access to frequently used data. When the processor needs some data, it checks first to see if it is available in the cache. This can avoid having to retrieve the data from a slower source such as the primary memory (RAM). In a computer with 2 processors, the amount of cache is doubled because each processor includes its own cache. This allows a larger amount of data to be quickly available and the speed increases by more than what is expected. Note that the superlinear effect happens despite the fact, as can be seen from the utilization figure (Figure 3), that the algorithm is not well balanced and has a stepping effect. The improvement in the cache hit ratio for the parallel algorithm had a greater impact on the performance than the communication overheads and the load balance. 6. Conclusion A pipeline parallel direct linear solver algorithm based on the Givens rotation was presented and results which showed superlinear speedup on a network of 12 Sun Ultra-10 workstations (333 MHz, 512 Mbytes RAM) are given. The superlinear speedup is attributed to the cache effect where the sum of cache capacity on all processors is used rather than one cache (on a sequential processor) and this affects the amount of cache miss/hit that occurs during the processing. The parallel algorithm itself is not well balanced although the pipeline effect hides the overheads of the communication and synchronization in the system once the pipeline is full. Improving the algorithm load balancing may be achieved by increasing the load of the first processor and reducing the load of the last one in the pipeline. However, care must be taken when the number of processors is low (2 or 3) not to overload the first processor and deteriorate the performance of its cache. References  I. S. Duff, “Direct Methods”, Technical Report, TR/PA/98/28, 1998. Available from: http://www.cerfacs.fr/algor/reports/Dissertation s/TR_PA_98_28.ps.gz, accessed 1/12/2003.  A. Geist et al., “PVM: Parallel Virtual Machine - A Users' Guide and Tutorial for Networked Parallel Computing”, MIT press, 1994.  A. Geist, J. Kohl, and P. Papadopoulos. Visualization, Debugging, and Performance in PVM. In Proc. Visualization and Debugging Workshop, October 1994.  W. Gropp et al., Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd Edition, Scientific and Engineering Computing Series, 1999.  J. Papay, M.J. Zemerly and G.R. Nudd, “Pipelining the Givens Linear Solver on Distributed Memory Machine”, Supercomputer Journal, Vol XII, no. 3, December 1996, pp 3743.  J.W. Givens, “Computation of plane unitary rotations transforming a general matrix to a triangular form”, Journal of Soc. Ind. Appl. Math., Vol 6, 1958, pp. 26-50.  http://www.eli.sdsu.edu/courses/spring96/cs662/ notes/speedup/speedup.html,accessed 1/12/2003.  G.M. Amdahl, “Validity of single-processor approach to achieving large-scale computing capability”, Proceedings of AFIPS Conference, Reston, VA. 1967. pp. 483-485.