Sparse matrix computations

Relational Query Processing Approach to Compiling Sparse Matrix Codes Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli 1 Outline • Problem statement – Sparse matrix computations – Importance of sparse matrix formats – Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experimental results • Ongoing work and conclusions 2 Sparse Matrices and Their Applications    0 0   0  0 0  0 0 0 0 0 0   0 0  0 0 0 0 0 0 0  0 0  0 0 0  0  0 • Number of non-zeroes per row/column << n • Often, less that 0.1% non-zero • Applications: – Numerical simulations, (non)linear optimization, graph theory, information retrieval, ... 3 Application: numerical simulations • Fracture mechanics Grand Challenge project: – Cornell CS + Civil Eng. + other schools; Title: /usr/u/v ladimir/priv ate/talks /job/crack.eps c Creator: MATLAB, The Mathw orks , Inc . Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . – supported by NSF,NASA,Boeing • A system of differential equations is solved over a continuous domain • Discretized into an algebraic system in variables x(i) • System of linear equations Ax=b is at the core • Intuition: A is sparse because the physical interactions are local 4 Application: Authoritative sources on the Web • Hubs and authorities on the Web Hubs • Graph G=(V,E) of the documents • A(u,v) = 1 if (u,v) is an edge • A is sparse! Authorities • Eigenvectors of AT A identify hubs, authorities and their clusters (“communities”) [Kleinberg,Raghavan ‘97] 5 Sparse matrix algorithms • Solution of linear systems – Direct methods (Gaussian elimination): A = LU • Impractical for many large-scale problems • For certain problems: O(n) space, O(n) time – Iterative methods • Matrix-vector products: y = Ax • Triangular system solution: Lx=b • Incomplete factorizations: A  LU • Eigenvalue problems: – Mostly matrix-vector products + dense computations 6 Sparse matrix computations • “DOANY” -- operations in any order – – – – Vector ops (dot product, addition,scaling) Matrix-vector products Rarely used: C = A+B Important: C  A+B, A  A + UV • “DOACROSS” -- dependencies between operations – Triangular system solution: Lx = b • More complex applications are built out of the above + dense kernels • Preprocessing (e.g. storage allocation): “graph theory” 7 Outline • Problem statement – Sparse matrix computations – Sparse Matrix Storage Formats – Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 8 Storing Sparse Matrices • Compressed formats are essential – O(nnz) time/space, not O(n²) – Example: matrix-vector product • 10M row/columns, 50 non-zeroes/row • 5 seconds vs 139 hours on a 200Mflops computer (assuming huge memory) • A variety of formats are used in practice – Application/architecture dependent – Different memory usage – Different performance on RISC processors 9 Point formats Coordinate a 0 b 0  c 0 d 0    e f 0 g  h i 0 j    1 1 2 3 2 3 4 3 4 4 3 1 3 4 1 2 4 1 2 1 b a d g c f j e i h Compressed Column Storage 1 5 7 9 11 1 2 3 4 3 4 1 2 3 4 3 a c e h f i b d g j 10 Block formats a e  0 0  x z  b 0 0 c d f 0 0 g 0  0 h k 0 0 0 p q 0 0  y 0 0 0 0 t 0 0 0 0  1 3 4 5 1 3 2 1 aebf x… h… c… • Block Sparse Column • “Natural” for physical problems with several unknowns at each point in space • Saves storage: 25% for 2-by-2 blocks • Improves performance on modern RISC processors 11 Why multiple formats: performance MFlops 35 30 25 CRS JDIAG Bsolve 20 15 10 5 an 1 sh er m pl us em m bc ss tm 27 s cf d 68 5_ bu m ed iu m 0 • Sparse matrix-vector product • Formats: CRS, Jagged diagonal, BlockSolve • On IBM RS6000 (66.5 MHz Power2) • Best format depends on the application (20-70% advantage) 12 Bottom line • Sparse matrices are used in a variety of application areas • Have to be stored in compressed data structures • Many formats are used in practice – Different storage/performance characteristics • Code development is tedious and error-prone – No random access – Different code for each format – Even worse in parallel (many ways to distribute the data) 13 Libraries • Dense computations: Basic Linear Algebra Subroutines – Implemented by most computer vendors – Few formats, easy to parametrize: row/column-major, symmetric/unsymmetric, etc • Other computations are built on top of BLAS • Can we do the same for sparse matrices? 14 Sparse Matrix Libraries • Sparse Basic Linear Algebra Subroutine (SPBLAS) library [Pozo,Remington @ NIST] – 13 formats ==> too many combinations of “A op B” – Some important ops are not supported – Not extensible • Coarse-grain solver packages [BlockSolve,Aztec,…] – Particular class of problems/algorithms (e.g. iterative solution) – OO approaches: hooks for basic ops (e.g. matrix-vector product) 15 Our goal: generate sparse codes automatically • Permit user-defined sparse data structures • Specialize high-level algorithm for sparsity, given the formats FOR I=1,N sum = sum + X(I)*Y(I) FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I) executable code 16 Input to the compiler • FOR-loops are sequential • DO-loops can be executed in any order (“DOANY”) • Convert dense DO-loops into sparse code DO I=1,N; J=1,N Y(I)=Y(I)+A(I,J)*X(J) for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j); 17 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 18 An example: locality enhancement • Matrix-vector product, array A stored in column/major order FOR I=1,N FOR J=1,N Y(I) = Y(I) + A(I,J)*X(J) Stride-N • Would like to execute the code as: FOR J=1,N FOR I=1,N Y(I) = Y(I) + A(I,J)*X(J) Stride-1 • In general? 19 An abstraction: polyhedra • Loop nests == polyhedra in integer spaces 1  i  N  1 j  i FOR I=1,N FOR J=1,I ….. • Transformations i j j i • Used in production and research compilers (SGI, HP, IBM) 20 Caveat • The polyhedral model is not applicable to sparse computations FOR I=1,N FOR J=1,N IF (A(I,J)  0) THEN Y(I) = Y(I) + A(I,J)*X(J)  1 i  N   1 j  N  A(i , j )  0  • Not a polyhedron What is the right formalism? 21 Extensions for sparse matrix code generation FOR I=1,N FOR J=1,N IF (A(I,J)  0) THEN Y(I)=Y(I)+A(I,J)*X(J) • A is sparse, compressed by column • Interchange the loops, encapsulate the guard FOR J=1,N FOR I=1,N such that A(I,J)  0 ... • “Control-centric” approach: transform the loops to match the best access to data [Bik,Wijshoff] 22 Limitations of the control-centric approach • Requires well-defined direction of access CCS CRS COORDINATE (J,I) loop order (I,J) loop order ???? 23 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 24 Data-centric transformations • Main idea: concentrate on the data DO I=…..; J=... …..A(F(I,J))….. • Array access function: <row,column> = F(I,J) • Example: coordinate storage format: Title: coord.fig Creator: fig2dev Version 3.1 Patchlevel 2 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . 25 Data-centric sparse code generation • If only a single sparse array: FOR <row,column,value> in A I=row; J=column Y(I)=Y(I)+value*X(J) • For each data structure provide an enumeration method • What if more than one sparse array? – Need to produce efficient simultaneous enumeration 26 Efficient simultaneous enumeration DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) • Options: – – – – Title: dot2.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. Enumerate X, search Y: “data-centric on” X Enumerate Y, search X: “data-centric on” Y Can speed up searching by scattering into a dense vector If both sorted: “2-finger” merge • Best choice depends on how X and Y are stored • What is the general picture? 27 An observation DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) Title: dot-relations.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. • Can view arrays as relations (as in “relational databases”) X(i,x) Y(i,y) • Have to enumerate solutions to the relational query Join(X(i,x), Y(i,y)) 28 Connection to relational queries • Dot product Join(X,Y) Dot product Equi-join Enumerate/Search Enumerate/Search Scatter Hash join “2-finger” (Sort)Merge join • General case? 29 From loop nests to relational queries DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))….. • Arrays are relations (e.g. A(r,c,a)) – Implicitly store zeros and non-zeros • Integer space of loop variables is a relation, too: Iter(i,j,k,…) • Access predicate S: relates loop variables and array elements • Sparsity predicate P: “interesting” combination of zeros/non-zeros Select(P, Select(S Bounds, Product(Iter, A, B, …))) 30 Why relational queries? [Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other E.F.Codd (CACM, 1970) • Want to separate what is to be computed from how 31 Bernoulli Sparse Compilation Toolkit Input program Front-end Query Optimizer Plan Instantiator CRS CCS BRS Coordinate …. Low level C code • BSCT is about 40K lines of ML + 9K lines of C • Query optimizer at the core • Extensible: new formats can be added 32 Query optimization: ordering joins select(a0  x0, join(A(i,j,a), X(j,x), Y(i,y))) A in CCS A in CRS Join(Join(A,X), Y) Join(Join(A,Y), X) FOR J  Join(A,X) FOR I  Join(A(*,J), Y) FOR I  Join(A,Y) FOR J  Join(A(I,*),X) 33 Query optimization: implementing joins FOR I  Join(A,Y) FOR J  Join(A(I,*), X) .…. FOR I  Merge(A,Y) H = scatter(X) FOR J  enumerate A(I,*), search H ….. • Output is called a plan 34 Instantiator: executable code generation H=scatter X FOR I  Merge(A,Y) FOR J  enumerate A(I,*), search H ….. ….. for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) ….. • Macro expansion • Open system 35 Summary of the compilation techniques • Data-centric methodology: walk the data,compute accordingly • Implementation for sparse arrays – arrays = relations, loop nests = queries • Compilation path – Main steps are independent of data structure implementations • Parallel code generation – Ownership, communication sets,... = relations • Difference from traditional relational databases query opt. – Selectivity of predicates not an issue; affine joins 36 Experiments • Sequential – Kernels from SPBLAS library – Iterative solution of linear systems • Parallel – Iterative solution of linear systems – Comparison with the BlockSolve library from Argonne NL – Comparison with the proposed “High-Performance Fortran” standard 37 Setup • IBM SP-2 at Cornell • 120 MHz P2SC processor at each node – Can issue 2 multiply-add instructions per cycle – Peak performance 480 Mflops – Much lower on sparse problems: < 100 Mflops • Benchmark matrices – From Harwell-Boeing collection – Synthetic problems 38 Matrix-vector products VBR/MV 100 100 80 80 LIB BSCT BSCT_OPT 60 40 Mflops MFLOPS BSR/MV LIB BSCT BSCT_OPT 60 40 20 20 0 0 5 8 11 14 17 20 25 Block size 5 8 11 14 17 20 25 Block size • BSR = “Block Sparse Row” • VBR = “Variable Block Sparse Row” • BSCT_OPT = Some “Dragon Book” optimizations by hand – Loop invariant removal 39 Solution of Triangular Systems BSR/TS VBR/TS 100 100 LIBRAY BSCT BSCT_OPT 60 40 20 80 MFlops Mflops 80 LIB BSCT BSCT_OPT 60 40 20 0 0 5 8 11 14 17 Block size 20 25 5 8 11 14 17 20 25 Block size • Bottom line: – Can compete with the SPBLAS library (need to implement loop invariant removal :-) 40 Iterative solution of sparse linear systems • Essential for large-scale simulations • Preconditioned Conjugate Gradients (PCG) algorithm – Basic kernels: y=Ax, Lx=b +dense vector ops • Preprocessing step – – – – Find M such that M 1AM 1  I Incomplete Cholesky factorization (ICC): A  CC T T Basic kernels: A  A  uv , sparse vector scaling Can not be implemented using the SPBLAS library • Used CCS format (“natural” for ICC) 41 Iterative Solution ICC/PCG 60 48 50 47 46 40 Mflops 40 ICC 30 PCG 20 10 5.86 4.22 2.9 1.9 0 2000 4000 8000 16000 Matrix size • ICC: a lot of “sparse overhead” • Ongoing investigation (at MathWorks): – Our compiler-generated ICC is 50-100 times faster than Matlab implementation !! 42 Iterative solution (cont.) • Preliminary comparison with IBM ESSL DSRIS – DSRIS implements PCG (among other things) • On BCSSTK30; have set values to vary the convergence • BSCT ICC takes 1.28 secs • ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs • PCG iterations are ~15% faster in ESSL 43 Parallel experiments • Conjugate Gradient algorithm – vs BlockSolve library (Argonne NL) • “Inspector” phase – Pre-computes what communication needs to occur – Done once, might be expensive • “Executor” phase – “Receive-compute-send-...” • On Boeing-Harwell matrices • On synthetic grid problems to understand scalability 44 Executor performance Executor Performance (grid problems) 3 2 Bsolve 1 BSCT 0 2 4 8 Time (seconds) Time (Seconds) Executor Performance (BCSSTK32) 2.5 2 1.5 1 0.5 0 16 Number of Processors Bsolve BSCT 2 4 8 16 32 64 Number of processors • Grid problems: problem size per processor is constant – 135K rows, ~4.6M non-zeroes • Within 2-4% of the library 45 Inspector overhead 3 2.5 2 1.5 1 0.5 0 Inspector Overhead (grid problems) Bsolve BSCT HPF-2 2 4 8 Ratio Ratio Inspector Overhead (BCSSTK32) 10 8 6 4 2 0 16 Number of Processors Bsolve BSCT HPF-2 2 4 8 16 32 64 Number of processors • Ratio of the inspector to single iteration of the executor – A problem-independent measure • “HPF-2” -- new data-parallel Fortran standard – Lowest-common denominator, inspectors are not scalable 46 Experiments: summary • Sequential – Competitive with SPBLAS library • Parallel – Inspector phase should exploit formats (cf. HPF-2) 47 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 48 Ongoing work • Packaging – “Library-on-demand”; as a Matlab toolbox • Parallel code generation – Extend to handle more kernels • Core of the compiler – Disjunctive queries, fill 49 Ongoing work • Packaging – “Library-on-demand”; as a Matlab toolbox – Completely automatic tool; data structure selection • Out-of-core computations • Parallel code generation – Extend to handle more kernels • Core of the compiler – Disjunctive queries, fill 50 Related work - compilers • Polyhedral model [Lamport ’78, …] • Sparse compilation [Bik,Wijshoff ‘92] • Support for sparse computations in HPF-2 [Saltz,Ujaldon et al] – Fixed data structures – Separate compilation path for dense/sparse • Data-centric blocking [Kodukula,Ahmed,Pingali ‘97] 51 Related work - languages and compilers • Polyhedral model [Lamport ’78, …] • Parallelizing compilers/parallel languages – e.g. HPF, ZPL • Transformational programming [Gries] • Software engineering through views [Novak] • Sparse compilation [Bik,Wijshoff ‘92] • Support for sparse computations in HPF-2 [Saltz,Ujaldon et al] • Data-centric blocking [Kodukula,Ahmed,Pingali ‘97] 52 Related work - languages and compilers • Compilation of dense matrix (regular) codes – Polyhedral model [Lamport ‘78, ...] – Data-parallel languages (e.g. HPF, ZPL) • Compilation of sparse matrix codes – [Bik, Wijshoff] -- sequential sparse compiler – [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2 – Fixed data structures, not extensible • Programming with ADTs – SETL -- automatic data structure selection for set operations – Transformational systems (e.g. Polya [Gries]) – Software reuse through views [Novak] 53 Related work - databases • Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90] • Extensible database systems [Predator, …] Predator Utilities BSR CRS ... COORD Compiler OPT Images OPT Sequences Relations OPT BSCT ... Utilities 54 Conclusions/Contributions • Sparse matrix computations are widely used in practice • Code development is tedious and error-prone • Bernoulli Sparse Compilation Toolkit: – Arrays as relations, loops as queries – Compilation as query optimization • Algebras in optimizing compilers Data-flow analysis Lattice algebra Dense matrix computations Polyhedral algebra Sparse matrix computations Relational algebra 55 Future work • Sparse compiler – Completely automatic system, data structure selection – Out-of-core computations • Compilation of signal/image processing applications – dense matrix computations + fast transforms (e.g. DCT) – multiple data representations • Programming languages and databases – e.g. Java and JDBC 56 Future work • Sparse compiler – Completely automatic system, data structure selection – Out-of-core computations • Extensible compilation – Want to support multiple formats, optimizations across basic ops – Example: signal/image processing 57 Future work • Application area: signal/image processing – compositions of transforms, such as FFTs – multiple data representations – important to optimize the flow of data through memory hierarchies – IBM ESSL: FFT at close to peak performance – algebra: Kronecker products 58 Future interests Computational Science Compilers Sparse compiler Data mining Database Programming Systems Databases 59

Sparse matrix computations

Related documents

Products

Support

Sparse matrix computations

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib