ON SPARSE MATRIX VECTOR PRODUCT PERFORMANCE EVALUATION FOR EFFICIENT DISTRIBUTION ON LARGE SCALE SYSTEMS Nahid EMAD(1), Olfa HAMDI-LARBI(2), and Zaher MAHJOUB(3) (2,3)Faculty of Sciences of Tunis, CS Department Campus Universitaire – 2092 Manar II - Tunis TUNISIA (1,2)University of Versailles, PRiSM Laboratory 78035 Versailles Cedex FRANCE (1) Nahid.emad@prism.uvsq.fr, (2)ola_ola79@yahoo.fr, (3)Zaher.mahjoub@fst.rnu.tn 1. INTRODUCTION Sparse computations correspond to algorithms processing sparse matrices i.e. matrices involving very few non zero elements [CLS98]. Several scientific applications often use kernels performing computations on large size sparse matrices e.g. in molecular dynamics, fluids dynamics, image processing, telecommunications, etc. [Pot97]. The kernels used in these fields mainly belong to the domain of sparse linear algebra [Pot97]. As a matter of fact, the sparse matrixvector product (SMVP), where the matrix is sparse and the vector is dense, constitutes a much used kernel in most of iterative methods to solve these problems. It has to be underlined that processing sparse matrices induces the important question of the choice of an adequate storage format for them. An efficient distribution of the SMVP on nodes of large distributed systems such as computing grids first requires an evaluation of the performances of the kernel on the nodes separately. Our work presented here addresses this important pre-distribution phase. More precisely, it deals with the determination of the best storage format for the sparse matrices providing the best performances for the SMVP on such systems. A large scale distributed system being a set of heterogeneous and delocalized processing resources, we are interested here in the SMVP optimization study for resources having different characteristics. Several SMVP versions are studied; each one corresponds to a different storage format for the sparse matrix. Indeed, we consider four SMVP versions related to four storage formats, namely SMVP-DNS (DeNSe format: 2-D array), SMVP-CSR (Compressed Storage Row format), SMVP-COO (COOrdinate format) and SMVP-BND (BaNDed format). For each format, three sequential target machines, having different processor characteristics, were used to achieve a series of experimentations: an Intel dual-core PC, a processor of an AMD quad-core bi-processor and a SUN dual-core workstation. This paper constitutes a generalized and deeply extended version of a previous work [EHM05] where we have focused the SMVP optimization for only the CSR format. In section 2, we give a brief presentation on sparse matrices. The four versions of the SMVP are discussed in section 3. Section 4 is devoted to the study of the optimization of various versions of SMVP. Experimental results are then presented in section 5. 2. SPARSE MATRICES Let A be a real large sized square NN matrix and NZ the number of its nonzero elements. If NZ is very small (resp. large), that is to say NZ=O(N) (resp. O(N 2)), A is considered sparse (resp. dense). The density of a matrix corresponds to the proportion of its nonzero elements i.e. NZ/N2. Sparse matrices correspond to arrays for which the large number of zero elements obliges the user, for reasons of size memory and computing performances, to use a storage format for only nonzero elements. These formats are called storage (or compressed) formats. As example of storage format, we may cite CSR (Compressed Storage Row), CSC (Compressed Storage Column), MSR (Modified Storage Row), BND (BaNDed), DIA (Diagonal), JAD (JAgged Diagonal) formats, etc [Saa94]. In this paper, we are interested in four 1 storage formats i.e. DNS, CSR, COO and BND. Indeed, the first three are general formats which may be used for any sparse matrix structure and are the mostly encountered in the literature. We choosed, in addition, the BND format, a specific one generally used for banded matrices. Indeed, several real applications handle such matrix structure. Moreover, any randomly structured sparse matrix may be transformed into a banded matrix via the Cuthill-Mckee algorithm [DER92]. Let us add that our work may be easily generalized to deal with two other existing storage formats. 3. SMVP ALGORITHM VERSIONS FOR DIFFERENT STORAGE FORMATS The SMVP algorithm structure depends on the storage format used for the sparse matrix. We present in the following the four corresponding, denoted SMVP-DNS, SMVP-CSR, SMVP-COO and SMVP-BND. do i = l, N / computing b=Ax / do j = 1, N b[i]=b[i]+A[i][j]*x [j] enddo enddo Figure 1. SMVP-DNS do i = l, N do j = IA[i] , IA[i+l] -1 b[i] = b[i]+ A [j]*x[JA[j]] / A : nonzero elements of A, JA : column indices for elements in A, IA contains pointers on the beginning of each row in both A and JA / enddo enddo Figure 3. SMVP-CSR do i=1,NZ b[IA[i]]=b[IA[i]]+A[i]*x[JA[i]] /A : nonzero elements of A, IA (resp.JA) : row (resp. column) indices for elements in A / enddo Figure 2. SMVP-COO do i = 1, N do j = 1 , ML+MU+1 b[i]=b[i]+/ABD[i][j]*x[min(N,max(0,i+j-ML))] / ABD: (N,ML+MU+1) array with elements of row i of A stored in row i of ABD, ML (resp. MU): lower (resp. upper) bandwidth / enddo enddo Figure 4. SMVP-BND 4. SMVP OPTIMIZATION In some frameworks, optimizing SMVP requires the creation of a new adapted storage format [PiH99]. Other optimizations may be specific to some matrix structures [PiH99][KGK08]. Several works propose to reorder the matrix (via row/column permutations) by using either the reversed Cuthill-McKee strategy or the Greedy method for blockrestructuring [DER92]. Other works depend on the target architecture [ImY01], [Vud02]. Contrary to previous works known in the literature, we apply to the SMVP specific high level optimization techniques that are independent of both the matrix structure, the sparse format, the data structure used for this format and the target architecture. This is done in order to fairly optimize the various SMVP versions corresponding to the used storage formats. The final aim is to achieve afterwards a more precise and fair comparison between these versions. For the SMVP optimization, we applied techniques we classified into two categories: (i) those which can be manually applied to the code, and (ii) those which can be selected as compilation options. Among manual optimization techniques, we can mention scalar replacement, removing invariant operations, loop unrolling, etc [GBS94]. Concerning automatic optimizations, we used UNIX C compiler which offers several options allowing an automatic optimization of the source code e.g. option -On, where ‘n’ is the level of the desired optimization (i.e. 0 for no-optimization, 1 for local optimizations, etc). We also used the compiler optimization options -funroll-loop (unrolling for only one loop) and -funroll-all-loops (unrolling to all loops) [Wol97]. 5. EXPERIMENTAL RESULTS We implemented the optimized versions of the algorithms presented in figures 1-4 on three different processors: The first one (machine 1) is an Intel dual-core PC (1.8 Ghz per core), the second processor (machine 2) is an AMD quad-core biprocessor (1.9 GHz per core) and the last one (machine 3) is a dual-core SUN Sparc64 workstation (1.3 GHz per core). 2 We measured the CPU time variations of the different versions of SMVP in terms of the unrolling factor applied in the inner do loop. Machine compiler optimization options were also applied. We used sample matrices of different sizes and densities either randomly generated or downloaded from the Matrix Market site [MaM07]. 7. MAIN RESULTS AND CONCLUSION In this paper, we presented sparse matrix-vector product (SMVP) algorithms for four storage formats on which were applied different optimization techniques. Through a series of experimentations achieved on three target machines, we noticed that, except for the particular cases of the CSR and COO formats on machine 3, compiler optimizations -On always improve the SMVP performances. Scalar replacement generally improves the performances, except in the case of SMVP-DNS and SMVP-COO on machine 2 where, if options –On are not used, performances decrease. Options -funroll-loop and -funrollall-loops have no impact on the SMVP performances. Manual unrolling improves the SMVP performances for the different versions on machine 3 and also when we use no compiler option on machines 1 and 2. On these two latter, improvements vanish when using any option -On. Manual unrolling with a high factor (larger than 64) may induce performance decrease especially on machine 1. To summarize, it is interesting to use scalar replacement on the different machines and this holds for all the SMVP versions. On machines 1 and 2, we better use in addition, one of the compiler options -On for all the SMVP versions. In this case, manual unrolling is needless. Finally, in the particular cases of the SMVP-CSR and SMVP-COO on machine 3, manual unrolling may improve the performances whereas options -On degrade them. We can finally say that the results obtained so far show that the SMVP performance optimization criteria vary from one machine to another and thus have to be fitted to nodes involved by large scale distributed systems. The effective SMVP distribution phase constitutes the subject of another paper. ABRIDGED LIST OF REFERENCES [CLS98] B. L. Chamberlain, E. C. Lewis, L. Snyder, A Region-based Approach for Sparse Parallel Computing, UW CSE Technical Report UW-CSE-98-11-01, 1998. [DER92] I. S. Duff, A. M. Erisman & J. K. Reid, Direct methods for sparse matrices, Oxford Science Publications, 1992. [EHM05] N. Emad, O. Hamdi & Z. Mahjoub, On Sparse matrix-vector product optimization, ACS/IEEE International Conference on Computer Systems and Applications (AICCSA ’05), Cairo, Egypt, 2005. [GBS94] S.L. Graham, D.F. Bacon & O.J. Sharp, Technical Report, University of California, 1994. [ImY01] E. Im & K. A. Yelick, Optimizing sparse matrix-vector multiplication for register reuse, International Conference on Computational Science, San Francisco, California, May 2001. [KGK08] K. Kourtis, G. Goumas & N. Koziris, Optimizing sparse matrix-vector multiplication using index and value compression, Conference on Computing Frontiers, Ischia, Italy, 2008. [MaM07] http://math.nist.gov/MatrixMarket/, 2007. [PiH99] A. Pinar & M. T. Heath, Improving performance of sparse matrix-vector multiplication, ACM/IEEE Supercomputing 1999 Conference (SC '99), Portland, 1999. [Pot97] W. M. Pottenger, Theory, techniques and experiments in solving recurrences in computer programs, Ph.D Thesis, University of Illinois, 1997. [Saa94] Y. Saad, SPARSKIT: a Basic tool kit for sparse matrix computation. http://www.cs.umn.edu/Research/arpa/SPARSKIT/paper.ps, 1994. [Vud02] R. Vuduc, J. Demmel, K. A. Yelick, S. Kamil, R. Nishtana & B. C. Lee, Performance optimizations and bounds for sparse matrix-vector multiply, SC International Conference for High Performance Computing, Storage and Analysis, Baltimore, 2002. [Wol97] S. Wolf, UNIX, vi and x-windows Guide, Joint Institute for Computational Science, 1997. 3