Emad-Hamd-iMahjoub-ABSTRACT

advertisement
ON SPARSE MATRIX VECTOR PRODUCT
PERFORMANCE EVALUATION FOR EFFICIENT
DISTRIBUTION ON LARGE SCALE SYSTEMS
Nahid EMAD(1), Olfa HAMDI-LARBI(2), and Zaher MAHJOUB(3)
(2,3)Faculty of Sciences of Tunis, CS Department
Campus Universitaire – 2092 Manar II - Tunis TUNISIA
(1,2)University of Versailles, PRiSM Laboratory
78035 Versailles Cedex FRANCE
(1)
Nahid.emad@prism.uvsq.fr, (2)ola_ola79@yahoo.fr, (3)Zaher.mahjoub@fst.rnu.tn
1. INTRODUCTION
Sparse computations correspond to algorithms processing sparse matrices i.e. matrices involving very few non zero
elements [CLS98]. Several scientific applications often use kernels performing computations on large size sparse
matrices e.g. in molecular dynamics, fluids dynamics, image processing, telecommunications, etc. [Pot97]. The kernels
used in these fields mainly belong to the domain of sparse linear algebra [Pot97]. As a matter of fact, the sparse matrixvector product (SMVP), where the matrix is sparse and the vector is dense, constitutes a much used kernel in most of
iterative methods to solve these problems. It has to be underlined that processing sparse matrices induces the important
question of the choice of an adequate storage format for them. An efficient distribution of the SMVP on nodes of large
distributed systems such as computing grids first requires an evaluation of the performances of the kernel on the nodes
separately. Our work presented here addresses this important pre-distribution phase. More precisely, it deals with the
determination of the best storage format for the sparse matrices providing the best performances for the SMVP on such
systems. A large scale distributed system being a set of heterogeneous and delocalized processing resources, we are
interested here in the SMVP optimization study for resources having different characteristics. Several SMVP versions
are studied; each one corresponds to a different storage format for the sparse matrix. Indeed, we consider four SMVP
versions related to four storage formats, namely SMVP-DNS (DeNSe format: 2-D array), SMVP-CSR (Compressed
Storage Row format), SMVP-COO (COOrdinate format) and SMVP-BND (BaNDed format). For each format, three
sequential target machines, having different processor characteristics, were used to achieve a series of experimentations:
an Intel dual-core PC, a processor of an AMD quad-core bi-processor and a SUN dual-core workstation.
This paper constitutes a generalized and deeply extended version of a previous work [EHM05] where we have focused
the SMVP optimization for only the CSR format. In section 2, we give a brief presentation on sparse matrices. The four
versions of the SMVP are discussed in section 3. Section 4 is devoted to the study of the optimization of various
versions of SMVP. Experimental results are then presented in section 5.
2. SPARSE MATRICES
Let A be a real large sized square NN matrix and NZ the number of its nonzero elements. If NZ is very small (resp.
large), that is to say NZ=O(N) (resp. O(N 2)), A is considered sparse (resp. dense). The density of a matrix corresponds
to the proportion of its nonzero elements i.e. NZ/N2. Sparse matrices correspond to arrays for which the large number of
zero elements obliges the user, for reasons of size memory and computing performances, to use a storage format for
only nonzero elements. These formats are called storage (or compressed) formats. As example of storage format, we
may cite CSR (Compressed Storage Row), CSC (Compressed Storage Column), MSR (Modified Storage Row), BND
(BaNDed), DIA (Diagonal), JAD (JAgged Diagonal) formats, etc [Saa94]. In this paper, we are interested in four
1
storage formats i.e. DNS, CSR, COO and BND. Indeed, the first three are general formats which may be used for any
sparse matrix structure and are the mostly encountered in the literature. We choosed, in addition, the BND format, a
specific one generally used for banded matrices. Indeed, several real applications handle such matrix structure.
Moreover, any randomly structured sparse matrix may be transformed into a banded matrix via the Cuthill-Mckee
algorithm [DER92]. Let us add that our work may be easily generalized to deal with two other existing storage formats.
3. SMVP ALGORITHM VERSIONS FOR DIFFERENT STORAGE FORMATS
The SMVP algorithm structure depends on the storage format used for the sparse matrix. We present in the following
the four corresponding, denoted SMVP-DNS, SMVP-CSR, SMVP-COO and SMVP-BND.
do i = l, N / computing b=Ax /
do j = 1, N
b[i]=b[i]+A[i][j]*x [j]
enddo
enddo
Figure 1. SMVP-DNS
do i = l, N
do j = IA[i] , IA[i+l] -1
b[i] = b[i]+ A [j]*x[JA[j]]
/ A : nonzero elements of A, JA : column indices for
elements in A, IA contains pointers on the beginning
of each row in both A and JA /
enddo
enddo
Figure 3. SMVP-CSR
do i=1,NZ
b[IA[i]]=b[IA[i]]+A[i]*x[JA[i]]
/A : nonzero elements of A, IA (resp.JA) : row
(resp. column) indices for elements in A /
enddo
Figure 2. SMVP-COO
do i = 1, N
do j = 1 , ML+MU+1
b[i]=b[i]+/ABD[i][j]*x[min(N,max(0,i+j-ML))]
/ ABD: (N,ML+MU+1) array with elements of row i
of A stored in row i of ABD, ML (resp. MU): lower
(resp. upper) bandwidth /
enddo
enddo
Figure 4. SMVP-BND
4. SMVP OPTIMIZATION
In some frameworks, optimizing SMVP requires the creation of a new adapted storage format [PiH99]. Other
optimizations may be specific to some matrix structures [PiH99][KGK08]. Several works propose to reorder the matrix
(via row/column permutations) by using either the reversed Cuthill-McKee strategy or the Greedy method for blockrestructuring [DER92]. Other works depend on the target architecture [ImY01], [Vud02].
Contrary to previous works known in the literature, we apply to the SMVP specific high level optimization techniques
that are independent of both the matrix structure, the sparse format, the data structure used for this format and the target
architecture. This is done in order to fairly optimize the various SMVP versions corresponding to the used storage
formats. The final aim is to achieve afterwards a more precise and fair comparison between these versions.
For the SMVP optimization, we applied techniques we classified into two categories: (i) those which can be manually
applied to the code, and (ii) those which can be selected as compilation options. Among manual optimization techniques,
we can mention scalar replacement, removing invariant operations, loop unrolling, etc [GBS94]. Concerning automatic
optimizations, we used UNIX C compiler which offers several options allowing an automatic optimization of the source
code e.g. option -On, where ‘n’ is the level of the desired optimization (i.e. 0 for no-optimization, 1 for local
optimizations, etc). We also used the compiler optimization options -funroll-loop (unrolling for only one loop) and
-funroll-all-loops (unrolling to all loops) [Wol97].
5. EXPERIMENTAL RESULTS
We implemented the optimized versions of the algorithms presented in figures 1-4 on three different processors: The first
one (machine 1) is an Intel dual-core PC (1.8 Ghz per core), the second processor (machine 2) is an AMD quad-core biprocessor (1.9 GHz per core) and the last one (machine 3) is a dual-core SUN Sparc64 workstation (1.3 GHz per core).
2
We measured the CPU time variations of the different versions of SMVP in terms of the unrolling factor applied in the
inner do loop. Machine compiler optimization options were also applied. We used sample matrices of different sizes and
densities either randomly generated or downloaded from the Matrix Market site [MaM07].
7. MAIN RESULTS AND CONCLUSION
In this paper, we presented sparse matrix-vector product (SMVP) algorithms for four storage formats on which were applied
different optimization techniques. Through a series of experimentations achieved on three target machines, we noticed that,
except for the particular cases of the CSR and COO formats on machine 3, compiler optimizations -On always improve the
SMVP performances. Scalar replacement generally improves the performances, except in the case of SMVP-DNS and
SMVP-COO on machine 2 where, if options –On are not used, performances decrease. Options -funroll-loop and -funrollall-loops have no impact on the SMVP performances. Manual unrolling improves the SMVP performances for the different
versions on machine 3 and also when we use no compiler option on machines 1 and 2. On these two latter, improvements
vanish when using any option -On. Manual unrolling with a high factor (larger than 64) may induce performance decrease
especially on machine 1.
To summarize, it is interesting to use scalar replacement on the different machines and this holds for all the SMVP versions.
On machines 1 and 2, we better use in addition, one of the compiler options -On for all the SMVP versions. In this case,
manual unrolling is needless. Finally, in the particular cases of the SMVP-CSR and SMVP-COO on machine 3, manual
unrolling may improve the performances whereas options -On degrade them. We can finally say that the results obtained so
far show that the SMVP performance optimization criteria vary from one machine to another and thus have to be fitted to
nodes involved by large scale distributed systems. The effective SMVP distribution phase constitutes the subject of another
paper.
ABRIDGED LIST OF REFERENCES
[CLS98] B. L. Chamberlain, E. C. Lewis, L. Snyder, A Region-based Approach for Sparse Parallel Computing,
UW CSE
Technical Report UW-CSE-98-11-01, 1998.
[DER92] I. S. Duff, A. M. Erisman & J. K. Reid, Direct methods for sparse matrices, Oxford Science Publications, 1992.
[EHM05] N. Emad, O. Hamdi & Z. Mahjoub, On Sparse matrix-vector product optimization, ACS/IEEE International Conference on
Computer Systems and Applications (AICCSA ’05), Cairo, Egypt, 2005.
[GBS94] S.L. Graham, D.F. Bacon & O.J. Sharp, Technical Report, University of California, 1994.
[ImY01] E. Im & K. A. Yelick, Optimizing sparse matrix-vector multiplication for register reuse, International Conference on
Computational Science, San Francisco, California, May 2001.
[KGK08] K. Kourtis, G. Goumas & N. Koziris, Optimizing sparse matrix-vector multiplication using index and value compression,
Conference on Computing Frontiers, Ischia, Italy, 2008.
[MaM07] http://math.nist.gov/MatrixMarket/, 2007.
[PiH99] A. Pinar & M. T. Heath, Improving performance of sparse matrix-vector multiplication, ACM/IEEE Supercomputing 1999
Conference (SC '99), Portland, 1999.
[Pot97] W. M. Pottenger, Theory, techniques and experiments in solving recurrences in computer programs, Ph.D Thesis, University
of Illinois, 1997.
[Saa94] Y. Saad, SPARSKIT: a Basic tool kit for sparse matrix computation.
http://www.cs.umn.edu/Research/arpa/SPARSKIT/paper.ps, 1994.
[Vud02] R. Vuduc, J. Demmel, K. A. Yelick, S. Kamil, R. Nishtana & B. C. Lee, Performance optimizations and bounds for sparse
matrix-vector multiply, SC International Conference for High Performance Computing, Storage and Analysis, Baltimore,
2002.
[Wol97] S. Wolf, UNIX, vi and x-windows Guide, Joint Institute for Computational Science, 1997.
3
Download