Relational Query Processing Approach to Compiling Sparse Matrix Codes Vladimir Kotlyar Computer Science Department, Cornell University http://www.cs.cornell.edu/Info/Project/Bernoulli 1 Outline • Problem statement – Sparse matrix computations – Importance of sparse matrix formats – Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experimental results • Ongoing work and conclusions 2 Sparse Matrices and Their Applications 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 • Number of non-zeroes per row/column << n • Often, less that 0.1% non-zero • Applications: – Numerical simulations, (non)linear optimization, graph theory, information retrieval, ... 3 Application: numerical simulations • Fracture mechanics Grand Challenge project: – Cornell CS + Civil Eng. + other schools; Title: /usr/u/v ladimir/priv ate/talks /job/crack.eps c Creator: MATLAB, The Mathw orks , Inc . Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . – supported by NSF,NASA,Boeing • A system of differential equations is solved over a continuous domain • Discretized into an algebraic system in variables x(i) • System of linear equations Ax=b is at the core • Intuition: A is sparse because the physical interactions are local 4 Application: Authoritative sources on the Web • Hubs and authorities on the Web Hubs • Graph G=(V,E) of the documents • A(u,v) = 1 if (u,v) is an edge • A is sparse! Authorities • Eigenvectors of AT A identify hubs, authorities and their clusters (“communities”) [Kleinberg,Raghavan ‘97] 5 Sparse matrix algorithms • Solution of linear systems – Direct methods (Gaussian elimination): A = LU • Impractical for many large-scale problems • For certain problems: O(n) space, O(n) time – Iterative methods • Matrix-vector products: y = Ax • Triangular system solution: Lx=b • Incomplete factorizations: A LU • Eigenvalue problems: – Mostly matrix-vector products + dense computations 6 Sparse matrix computations • “DOANY” -- operations in any order – – – – Vector ops (dot product, addition,scaling) Matrix-vector products Rarely used: C = A+B Important: C A+B, A A + UV • “DOACROSS” -- dependencies between operations – Triangular system solution: Lx = b • More complex applications are built out of the above + dense kernels • Preprocessing (e.g. storage allocation): “graph theory” 7 Outline • Problem statement – Sparse matrix computations – Sparse Matrix Storage Formats – Difficulties in the development of sparse matrix codes • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 8 Storing Sparse Matrices • Compressed formats are essential – O(nnz) time/space, not O(n²) – Example: matrix-vector product • 10M row/columns, 50 non-zeroes/row • 5 seconds vs 139 hours on a 200Mflops computer (assuming huge memory) • A variety of formats are used in practice – Application/architecture dependent – Different memory usage – Different performance on RISC processors 9 Point formats Coordinate a 0 b 0 c 0 d 0 e f 0 g h i 0 j 1 1 2 3 2 3 4 3 4 4 3 1 3 4 1 2 4 1 2 1 b a d g c f j e i h Compressed Column Storage 1 5 7 9 11 1 2 3 4 3 4 1 2 3 4 3 a c e h f i b d g j 10 Block formats a e 0 0 x z b 0 0 c d f 0 0 g 0 0 h k 0 0 0 p q 0 0 y 0 0 0 0 t 0 0 0 0 1 3 4 5 1 3 2 1 aebf x… h… c… • Block Sparse Column • “Natural” for physical problems with several unknowns at each point in space • Saves storage: 25% for 2-by-2 blocks • Improves performance on modern RISC processors 11 Why multiple formats: performance MFlops 35 30 25 CRS JDIAG Bsolve 20 15 10 5 an 1 sh er m pl us em m bc ss tm 27 s cf d 68 5_ bu m ed iu m 0 • Sparse matrix-vector product • Formats: CRS, Jagged diagonal, BlockSolve • On IBM RS6000 (66.5 MHz Power2) • Best format depends on the application (20-70% advantage) 12 Bottom line • Sparse matrices are used in a variety of application areas • Have to be stored in compressed data structures • Many formats are used in practice – Different storage/performance characteristics • Code development is tedious and error-prone – No random access – Different code for each format – Even worse in parallel (many ways to distribute the data) 13 Libraries • Dense computations: Basic Linear Algebra Subroutines – Implemented by most computer vendors – Few formats, easy to parametrize: row/column-major, symmetric/unsymmetric, etc • Other computations are built on top of BLAS • Can we do the same for sparse matrices? 14 Sparse Matrix Libraries • Sparse Basic Linear Algebra Subroutine (SPBLAS) library [Pozo,Remington @ NIST] – 13 formats ==> too many combinations of “A op B” – Some important ops are not supported – Not extensible • Coarse-grain solver packages [BlockSolve,Aztec,…] – Particular class of problems/algorithms (e.g. iterative solution) – OO approaches: hooks for basic ops (e.g. matrix-vector product) 15 Our goal: generate sparse codes automatically • Permit user-defined sparse data structures • Specialize high-level algorithm for sparsity, given the formats FOR I=1,N sum = sum + X(I)*Y(I) FOR I=1,N such that X(I)0 and Y(I)0 sum = sum + X(I)*Y(I) executable code 16 Input to the compiler • FOR-loops are sequential • DO-loops can be executed in any order (“DOANY”) • Convert dense DO-loops into sparse code DO I=1,N; J=1,N Y(I)=Y(I)+A(I,J)*X(J) for(j=0; j<N;j++) for(ii=colp(j);ii < colp(j+1);ii++) Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j); 17 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 18 An example: locality enhancement • Matrix-vector product, array A stored in column/major order FOR I=1,N FOR J=1,N Y(I) = Y(I) + A(I,J)*X(J) Stride-N • Would like to execute the code as: FOR J=1,N FOR I=1,N Y(I) = Y(I) + A(I,J)*X(J) Stride-1 • In general? 19 An abstraction: polyhedra • Loop nests == polyhedra in integer spaces 1 i N 1 j i FOR I=1,N FOR J=1,I ….. • Transformations i j j i • Used in production and research compilers (SGI, HP, IBM) 20 Caveat • The polyhedral model is not applicable to sparse computations FOR I=1,N FOR J=1,N IF (A(I,J) 0) THEN Y(I) = Y(I) + A(I,J)*X(J) 1 i N 1 j N A(i , j ) 0 • Not a polyhedron What is the right formalism? 21 Extensions for sparse matrix code generation FOR I=1,N FOR J=1,N IF (A(I,J) 0) THEN Y(I)=Y(I)+A(I,J)*X(J) • A is sparse, compressed by column • Interchange the loops, encapsulate the guard FOR J=1,N FOR I=1,N such that A(I,J) 0 ... • “Control-centric” approach: transform the loops to match the best access to data [Bik,Wijshoff] 22 Limitations of the control-centric approach • Requires well-defined direction of access CCS CRS COORDINATE (J,I) loop order (I,J) loop order ???? 23 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 24 Data-centric transformations • Main idea: concentrate on the data DO I=…..; J=... …..A(F(I,J))….. • Array access function: <row,column> = F(I,J) • Example: coordinate storage format: Title: coord.fig Creator: fig2dev Version 3.1 Patchlevel 2 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . 25 Data-centric sparse code generation • If only a single sparse array: FOR <row,column,value> in A I=row; J=column Y(I)=Y(I)+value*X(J) • For each data structure provide an enumeration method • What if more than one sparse array? – Need to produce efficient simultaneous enumeration 26 Efficient simultaneous enumeration DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) • Options: – – – – Title: dot2.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. Enumerate X, search Y: “data-centric on” X Enumerate Y, search X: “data-centric on” Y Can speed up searching by scattering into a dense vector If both sorted: “2-finger” merge • Best choice depends on how X and Y are stored • What is the general picture? 27 An observation DO I=1,N IF (X(I) 0 and Y(I) 0) THEN sum = sum + X(I)*Y(I) Title: dot-relations.fig Creator: fig2dev Version 3.1 Patchlevel 2 Prev iew : This EPS picture w as not s av ed w ith a preview inc luded in it. Comment: This EPS picture w ill print to a Pos tSc ript printer, but not to other ty pes of printers. • Can view arrays as relations (as in “relational databases”) X(i,x) Y(i,y) • Have to enumerate solutions to the relational query Join(X(i,x), Y(i,y)) 28 Connection to relational queries • Dot product Join(X,Y) Dot product Equi-join Enumerate/Search Enumerate/Search Scatter Hash join “2-finger” (Sort)Merge join • General case? 29 From loop nests to relational queries DO I, J, K, ... …..A(F(I,J,K,...))…..B(G(I,J,K,...))….. • Arrays are relations (e.g. A(r,c,a)) – Implicitly store zeros and non-zeros • Integer space of loop variables is a relation, too: Iter(i,j,k,…) • Access predicate S: relates loop variables and array elements • Sparsity predicate P: “interesting” combination of zeros/non-zeros Select(P, Select(S Bounds, Product(Iter, A, B, …))) 30 Why relational queries? [Relational model] provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation and organization of data on the other E.F.Codd (CACM, 1970) • Want to separate what is to be computed from how 31 Bernoulli Sparse Compilation Toolkit Input program Front-end Query Optimizer Plan Instantiator CRS CCS BRS Coordinate …. Low level C code • BSCT is about 40K lines of ML + 9K lines of C • Query optimizer at the core • Extensible: new formats can be added 32 Query optimization: ordering joins select(a0 x0, join(A(i,j,a), X(j,x), Y(i,y))) A in CCS A in CRS Join(Join(A,X), Y) Join(Join(A,Y), X) FOR J Join(A,X) FOR I Join(A(*,J), Y) FOR I Join(A,Y) FOR J Join(A(I,*),X) 33 Query optimization: implementing joins FOR I Join(A,Y) FOR J Join(A(I,*), X) .…. FOR I Merge(A,Y) H = scatter(X) FOR J enumerate A(I,*), search H ….. • Output is called a plan 34 Instantiator: executable code generation H=scatter X FOR I Merge(A,Y) FOR J enumerate A(I,*), search H ….. ….. for(I=0; I<N; I++) for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++) ….. • Macro expansion • Open system 35 Summary of the compilation techniques • Data-centric methodology: walk the data,compute accordingly • Implementation for sparse arrays – arrays = relations, loop nests = queries • Compilation path – Main steps are independent of data structure implementations • Parallel code generation – Ownership, communication sets,... = relations • Difference from traditional relational databases query opt. – Selectivity of predicates not an issue; affine joins 36 Experiments • Sequential – Kernels from SPBLAS library – Iterative solution of linear systems • Parallel – Iterative solution of linear systems – Comparison with the BlockSolve library from Argonne NL – Comparison with the proposed “High-Performance Fortran” standard 37 Setup • IBM SP-2 at Cornell • 120 MHz P2SC processor at each node – Can issue 2 multiply-add instructions per cycle – Peak performance 480 Mflops – Much lower on sparse problems: < 100 Mflops • Benchmark matrices – From Harwell-Boeing collection – Synthetic problems 38 Matrix-vector products VBR/MV 100 100 80 80 LIB BSCT BSCT_OPT 60 40 Mflops MFLOPS BSR/MV LIB BSCT BSCT_OPT 60 40 20 20 0 0 5 8 11 14 17 20 25 Block size 5 8 11 14 17 20 25 Block size • BSR = “Block Sparse Row” • VBR = “Variable Block Sparse Row” • BSCT_OPT = Some “Dragon Book” optimizations by hand – Loop invariant removal 39 Solution of Triangular Systems BSR/TS VBR/TS 100 100 LIBRAY BSCT BSCT_OPT 60 40 20 80 MFlops Mflops 80 LIB BSCT BSCT_OPT 60 40 20 0 0 5 8 11 14 17 Block size 20 25 5 8 11 14 17 20 25 Block size • Bottom line: – Can compete with the SPBLAS library (need to implement loop invariant removal :-) 40 Iterative solution of sparse linear systems • Essential for large-scale simulations • Preconditioned Conjugate Gradients (PCG) algorithm – Basic kernels: y=Ax, Lx=b +dense vector ops • Preprocessing step – – – – Find M such that M 1AM 1 I Incomplete Cholesky factorization (ICC): A CC T T Basic kernels: A A uv , sparse vector scaling Can not be implemented using the SPBLAS library • Used CCS format (“natural” for ICC) 41 Iterative Solution ICC/PCG 60 48 50 47 46 40 Mflops 40 ICC 30 PCG 20 10 5.86 4.22 2.9 1.9 0 2000 4000 8000 16000 Matrix size • ICC: a lot of “sparse overhead” • Ongoing investigation (at MathWorks): – Our compiler-generated ICC is 50-100 times faster than Matlab implementation !! 42 Iterative solution (cont.) • Preliminary comparison with IBM ESSL DSRIS – DSRIS implements PCG (among other things) • On BCSSTK30; have set values to vary the convergence • BSCT ICC takes 1.28 secs • ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs • PCG iterations are ~15% faster in ESSL 43 Parallel experiments • Conjugate Gradient algorithm – vs BlockSolve library (Argonne NL) • “Inspector” phase – Pre-computes what communication needs to occur – Done once, might be expensive • “Executor” phase – “Receive-compute-send-...” • On Boeing-Harwell matrices • On synthetic grid problems to understand scalability 44 Executor performance Executor Performance (grid problems) 3 2 Bsolve 1 BSCT 0 2 4 8 Time (seconds) Time (Seconds) Executor Performance (BCSSTK32) 2.5 2 1.5 1 0.5 0 16 Number of Processors Bsolve BSCT 2 4 8 16 32 64 Number of processors • Grid problems: problem size per processor is constant – 135K rows, ~4.6M non-zeroes • Within 2-4% of the library 45 Inspector overhead 3 2.5 2 1.5 1 0.5 0 Inspector Overhead (grid problems) Bsolve BSCT HPF-2 2 4 8 Ratio Ratio Inspector Overhead (BCSSTK32) 10 8 6 4 2 0 16 Number of Processors Bsolve BSCT HPF-2 2 4 8 16 32 64 Number of processors • Ratio of the inspector to single iteration of the executor – A problem-independent measure • “HPF-2” -- new data-parallel Fortran standard – Lowest-common denominator, inspectors are not scalable 46 Experiments: summary • Sequential – Competitive with SPBLAS library • Parallel – Inspector phase should exploit formats (cf. HPF-2) 47 Outline • Problem statement • State-of-the-art restructuring compiler technology • Technical approach and experiments • Ongoing work and conclusions 48 Ongoing work • Packaging – “Library-on-demand”; as a Matlab toolbox • Parallel code generation – Extend to handle more kernels • Core of the compiler – Disjunctive queries, fill 49 Ongoing work • Packaging – “Library-on-demand”; as a Matlab toolbox – Completely automatic tool; data structure selection • Out-of-core computations • Parallel code generation – Extend to handle more kernels • Core of the compiler – Disjunctive queries, fill 50 Related work - compilers • Polyhedral model [Lamport ’78, …] • Sparse compilation [Bik,Wijshoff ‘92] • Support for sparse computations in HPF-2 [Saltz,Ujaldon et al] – Fixed data structures – Separate compilation path for dense/sparse • Data-centric blocking [Kodukula,Ahmed,Pingali ‘97] 51 Related work - languages and compilers • Polyhedral model [Lamport ’78, …] • Parallelizing compilers/parallel languages – e.g. HPF, ZPL • Transformational programming [Gries] • Software engineering through views [Novak] • Sparse compilation [Bik,Wijshoff ‘92] • Support for sparse computations in HPF-2 [Saltz,Ujaldon et al] • Data-centric blocking [Kodukula,Ahmed,Pingali ‘97] 52 Related work - languages and compilers • Compilation of dense matrix (regular) codes – Polyhedral model [Lamport ‘78, ...] – Data-parallel languages (e.g. HPF, ZPL) • Compilation of sparse matrix codes – [Bik, Wijshoff] -- sequential sparse compiler – [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2 – Fixed data structures, not extensible • Programming with ADTs – SETL -- automatic data structure selection for set operations – Transformational systems (e.g. Polya [Gries]) – Software reuse through views [Novak] 53 Related work - databases • Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90] • Extensible database systems [Predator, …] Predator Utilities BSR CRS ... COORD Compiler OPT Images OPT Sequences Relations OPT BSCT ... Utilities 54 Conclusions/Contributions • Sparse matrix computations are widely used in practice • Code development is tedious and error-prone • Bernoulli Sparse Compilation Toolkit: – Arrays as relations, loops as queries – Compilation as query optimization • Algebras in optimizing compilers Data-flow analysis Lattice algebra Dense matrix computations Polyhedral algebra Sparse matrix computations Relational algebra 55 Future work • Sparse compiler – Completely automatic system, data structure selection – Out-of-core computations • Compilation of signal/image processing applications – dense matrix computations + fast transforms (e.g. DCT) – multiple data representations • Programming languages and databases – e.g. Java and JDBC 56 Future work • Sparse compiler – Completely automatic system, data structure selection – Out-of-core computations • Extensible compilation – Want to support multiple formats, optimizations across basic ops – Example: signal/image processing 57 Future work • Application area: signal/image processing – compositions of transforms, such as FFTs – multiple data representations – important to optimize the flow of data through memory hierarchies – IBM ESSL: FFT at close to peak performance – algebra: Kronecker products 58 Future interests Computational Science Compilers Sparse compiler Data mining Database Programming Systems Databases 59