Sparse matrix computations

advertisement
Relational Query Processing Approach to Compiling
Sparse Matrix Codes
Vladimir Kotlyar
Computer Science Department, Cornell University
http://www.cs.cornell.edu/Info/Project/Bernoulli
1
Outline
• Problem statement
– Sparse matrix computations
– Importance of sparse matrix formats
– Difficulties in the development of sparse matrix codes
• State-of-the-art restructuring compiler technology
• Technical approach and experimental results
• Ongoing work and conclusions
2
Sparse Matrices and Their Applications



0
0


0

0 0  0 0
0 0 0 0 

0 0  0 0
0 0 0 0 0

0 0  0 0
0  0  0
• Number of non-zeroes per row/column << n
• Often, less that 0.1% non-zero
• Applications:
– Numerical simulations, (non)linear optimization, graph
theory, information retrieval, ...
3
Application: numerical simulations
• Fracture mechanics Grand Challenge project:
– Cornell CS + Civil Eng. + other schools;
Title:
/usr/u/v ladimir/priv ate/talks /job/crack.eps c
Creator:
MATLAB, The Mathw orks , Inc .
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
– supported by NSF,NASA,Boeing
• A system of differential equations is
solved over a continuous domain
• Discretized into an algebraic system in variables x(i)
• System of linear equations Ax=b is at the core
• Intuition: A is sparse because the physical interactions are local
4
Application: Authoritative sources on the Web
• Hubs and authorities on the Web
Hubs
• Graph G=(V,E) of the documents
• A(u,v) = 1 if (u,v) is an edge
• A is sparse!
Authorities
• Eigenvectors of AT A identify hubs, authorities and their
clusters (“communities”) [Kleinberg,Raghavan ‘97]
5
Sparse matrix algorithms
• Solution of linear systems
– Direct methods (Gaussian elimination): A = LU
• Impractical for many large-scale problems
• For certain problems: O(n) space, O(n) time
– Iterative methods
• Matrix-vector products: y = Ax
• Triangular system solution: Lx=b
• Incomplete factorizations: A  LU
• Eigenvalue problems:
– Mostly matrix-vector products + dense computations
6
Sparse matrix computations
• “DOANY” -- operations in any order
–
–
–
–
Vector ops (dot product, addition,scaling)
Matrix-vector products
Rarely used: C = A+B
Important: C  A+B, A  A + UV
• “DOACROSS” -- dependencies between operations
– Triangular system solution: Lx = b
• More complex applications are built out of the above
+ dense kernels
• Preprocessing (e.g. storage allocation): “graph theory”
7
Outline
• Problem statement
– Sparse matrix computations
– Sparse Matrix Storage Formats
– Difficulties in the development of sparse matrix codes
• State-of-the-art restructuring compiler technology
• Technical approach and experiments
• Ongoing work and conclusions
8
Storing Sparse Matrices
• Compressed formats are essential
– O(nnz) time/space, not O(n²)
– Example: matrix-vector product
• 10M row/columns, 50 non-zeroes/row
• 5 seconds vs 139 hours on a 200Mflops computer
(assuming huge memory)
• A variety of formats are used in practice
– Application/architecture dependent
– Different memory usage
– Different performance on RISC processors
9
Point formats
Coordinate
a 0 b 0 
c 0 d 0 


e f 0 g 
h i 0 j 


1 1 2 3 2 3 4 3 4 4
3 1 3 4 1 2 4 1 2 1
b a d g c f j e i h
Compressed Column Storage
1 5 7 9 11
1 2 3 4 3 4 1 2 3 4
3
a c e h f i b d g j
10
Block formats
a
e

0
0

x
z

b 0 0 c d
f 0 0 g 0

0 h k 0 0
0 p q 0 0

y 0 0 0 0
t 0 0 0 0 
1 3 4 5
1 3 2 1
aebf
x…
h…
c…
• Block Sparse Column
• “Natural” for physical problems with several unknowns at each
point in space
• Saves storage: 25% for 2-by-2 blocks
• Improves performance on modern RISC processors
11
Why multiple formats: performance
MFlops
35
30
25
CRS
JDIAG
Bsolve
20
15
10
5
an
1
sh
er
m
pl
us
em
m
bc
ss
tm
27
s
cf
d
68
5_
bu
m
ed
iu
m
0
• Sparse matrix-vector product
• Formats: CRS, Jagged diagonal, BlockSolve
• On IBM RS6000 (66.5 MHz Power2)
• Best format depends on the application (20-70% advantage)
12
Bottom line
• Sparse matrices are used in a variety of application areas
• Have to be stored in compressed data structures
• Many formats are used in practice
– Different storage/performance characteristics
• Code development is tedious and error-prone
– No random access
– Different code for each format
– Even worse in parallel (many ways to distribute the data)
13
Libraries
• Dense computations: Basic Linear Algebra Subroutines
– Implemented by most computer vendors
– Few formats, easy to parametrize: row/column-major,
symmetric/unsymmetric, etc
• Other computations are built on top of BLAS
• Can we do the same for sparse matrices?
14
Sparse Matrix Libraries
• Sparse Basic Linear Algebra Subroutine (SPBLAS) library
[Pozo,Remington @ NIST]
– 13 formats ==> too many combinations of “A op B”
– Some important ops are not supported
– Not extensible
• Coarse-grain solver packages [BlockSolve,Aztec,…]
– Particular class of problems/algorithms
(e.g. iterative solution)
– OO approaches: hooks for basic ops
(e.g. matrix-vector product)
15
Our goal: generate sparse codes automatically
• Permit user-defined sparse data structures
• Specialize high-level algorithm for sparsity, given the formats
FOR I=1,N
sum = sum + X(I)*Y(I)
FOR I=1,N such that X(I)0 and Y(I)0
sum = sum + X(I)*Y(I)
executable code
16
Input to the compiler
• FOR-loops are sequential
• DO-loops can be executed in any order (“DOANY”)
• Convert dense DO-loops into sparse code
DO I=1,N; J=1,N
Y(I)=Y(I)+A(I,J)*X(J)
for(j=0; j<N;j++)
for(ii=colp(j);ii < colp(j+1);ii++)
Y(rowind(ii))=Y(rowind(ii))+vals(ii)*X(j);
17
Outline
• Problem statement
• State-of-the-art restructuring compiler technology
• Technical approach and experiments
• Ongoing work and conclusions
18
An example: locality enhancement
• Matrix-vector product, array A stored in column/major order
FOR I=1,N
FOR J=1,N
Y(I) = Y(I) + A(I,J)*X(J)
Stride-N
• Would like to execute the code as:
FOR J=1,N
FOR I=1,N
Y(I) = Y(I) + A(I,J)*X(J)
Stride-1
• In general?
19
An abstraction: polyhedra
• Loop nests == polyhedra in integer spaces
1  i  N

1 j  i
FOR I=1,N
FOR J=1,I
…..
• Transformations
i
j
j
i
• Used in production and research compilers (SGI, HP, IBM)
20
Caveat
• The polyhedral model is not applicable to sparse computations
FOR I=1,N
FOR J=1,N
IF (A(I,J)  0) THEN
Y(I) = Y(I) + A(I,J)*X(J)
 1 i  N

 1 j  N
 A(i , j )  0

• Not a polyhedron
What is the right formalism?
21
Extensions for sparse matrix code generation
FOR I=1,N
FOR J=1,N
IF (A(I,J)  0) THEN
Y(I)=Y(I)+A(I,J)*X(J)
• A is sparse, compressed by column
• Interchange the loops, encapsulate the guard
FOR J=1,N
FOR I=1,N such that A(I,J)  0
...
• “Control-centric” approach: transform the loops to match the
best access to data [Bik,Wijshoff]
22
Limitations of the control-centric approach
• Requires well-defined direction of access
CCS
CRS
COORDINATE
(J,I) loop order
(I,J) loop order
????
23
Outline
• Problem statement
• State-of-the-art restructuring compiler technology
• Technical approach and experiments
• Ongoing work and conclusions
24
Data-centric transformations
• Main idea: concentrate on the data
DO I=…..; J=...
…..A(F(I,J))…..
• Array access function: <row,column> = F(I,J)
• Example: coordinate storage format:
Title:
coord.fig
Creator:
fig2dev Version 3.1 Patchlevel 2
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
25
Data-centric sparse code generation
• If only a single sparse array:
FOR <row,column,value> in A
I=row; J=column
Y(I)=Y(I)+value*X(J)
• For each data structure provide an enumeration method
• What if more than one sparse array?
– Need to produce efficient simultaneous enumeration
26
Efficient simultaneous enumeration
DO I=1,N
IF (X(I) 0 and Y(I) 0) THEN
sum = sum + X(I)*Y(I)
• Options:
–
–
–
–
Title:
dot2.fig
Creator:
fig2dev Version 3.1 Patchlevel 2
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
Enumerate X, search Y: “data-centric on” X
Enumerate Y, search X: “data-centric on” Y
Can speed up searching by scattering into a dense vector
If both sorted: “2-finger” merge
• Best choice depends on how X and Y are stored
• What is the general picture?
27
An observation
DO I=1,N
IF (X(I) 0 and Y(I) 0) THEN
sum = sum + X(I)*Y(I)
Title:
dot-relations.fig
Creator:
fig2dev Version 3.1 Patchlevel 2
Prev iew :
This EPS picture w as not s av ed
w ith a preview inc luded in it.
Comment:
This EPS picture w ill print to a
Pos tSc ript printer, but not to
other ty pes of printers.
• Can view arrays as relations (as in “relational databases”)
X(i,x)
Y(i,y)
• Have to enumerate solutions to the relational query
Join(X(i,x), Y(i,y))
28
Connection to relational queries
• Dot product
Join(X,Y)
Dot product
Equi-join
Enumerate/Search
Enumerate/Search
Scatter
Hash join
“2-finger”
(Sort)Merge join
• General case?
29
From loop nests to relational queries
DO I, J, K, ...
…..A(F(I,J,K,...))…..B(G(I,J,K,...))…..
• Arrays are relations (e.g. A(r,c,a))
– Implicitly store zeros and non-zeros
• Integer space of loop variables is a relation, too: Iter(i,j,k,…)
• Access predicate S: relates loop variables and array elements
• Sparsity predicate P: “interesting” combination of zeros/non-zeros
Select(P, Select(S Bounds, Product(Iter, A, B, …)))
30
Why relational queries?
[Relational model] provides a basis for a high level data language
which will yield maximal independence between programs on the
one hand and machine representation and organization of data on
the other
E.F.Codd (CACM, 1970)
• Want to separate what is to be computed from how
31
Bernoulli Sparse Compilation Toolkit
Input program
Front-end
Query
Optimizer
Plan
Instantiator
CRS
CCS
BRS
Coordinate
….
Low level C code
• BSCT is about 40K lines of ML + 9K lines of C
• Query optimizer at the core
• Extensible: new formats can be added
32
Query optimization: ordering joins
select(a0  x0, join(A(i,j,a), X(j,x), Y(i,y)))
A in CCS
A in CRS
Join(Join(A,X), Y)
Join(Join(A,Y), X)
FOR J  Join(A,X)
FOR I  Join(A(*,J), Y)
FOR I  Join(A,Y)
FOR J  Join(A(I,*),X)
33
Query optimization: implementing joins
FOR I  Join(A,Y)
FOR J  Join(A(I,*), X)
.….
FOR I  Merge(A,Y)
H = scatter(X)
FOR J  enumerate A(I,*), search H
…..
• Output is called a plan
34
Instantiator: executable code generation
H=scatter X
FOR I  Merge(A,Y)
FOR J  enumerate A(I,*), search H
…..
…..
for(I=0; I<N; I++)
for(JJ=ROWP(I); JJ < ROWP(I+1); JJ++)
…..
•
Macro expansion
•
Open system
35
Summary of the compilation techniques
• Data-centric methodology: walk the data,compute accordingly
• Implementation for sparse arrays
– arrays = relations, loop nests = queries
• Compilation path
– Main steps are independent of data structure
implementations
• Parallel code generation
– Ownership, communication sets,... = relations
• Difference from traditional relational databases query opt.
– Selectivity of predicates not an issue; affine joins
36
Experiments
• Sequential
– Kernels from SPBLAS library
– Iterative solution of linear systems
• Parallel
– Iterative solution of linear systems
– Comparison with the BlockSolve library from Argonne NL
– Comparison with the proposed “High-Performance Fortran”
standard
37
Setup
• IBM SP-2 at Cornell
• 120 MHz P2SC processor at each node
– Can issue 2 multiply-add instructions per cycle
– Peak performance 480 Mflops
– Much lower on sparse problems: < 100 Mflops
• Benchmark matrices
– From Harwell-Boeing collection
– Synthetic problems
38
Matrix-vector products
VBR/MV
100
100
80
80
LIB
BSCT
BSCT_OPT
60
40
Mflops
MFLOPS
BSR/MV
LIB
BSCT
BSCT_OPT
60
40
20
20
0
0
5
8
11
14
17
20
25
Block size
5
8
11
14
17
20
25
Block size
• BSR = “Block Sparse Row”
• VBR = “Variable Block Sparse Row”
• BSCT_OPT = Some “Dragon Book” optimizations by hand
– Loop invariant removal
39
Solution of Triangular Systems
BSR/TS
VBR/TS
100
100
LIBRAY
BSCT
BSCT_OPT
60
40
20
80
MFlops
Mflops
80
LIB
BSCT
BSCT_OPT
60
40
20
0
0
5
8
11
14
17
Block size
20
25
5
8
11
14
17
20
25
Block size
• Bottom line:
– Can compete with the SPBLAS library
(need to implement loop invariant removal :-)
40
Iterative solution of sparse linear systems
• Essential for large-scale simulations
• Preconditioned Conjugate Gradients (PCG) algorithm
– Basic kernels: y=Ax, Lx=b +dense vector ops
• Preprocessing step
–
–
–
–
Find M such that M 1AM 1  I
Incomplete Cholesky factorization (ICC): A  CC T
T
Basic kernels: A  A  uv , sparse vector scaling
Can not be implemented using the SPBLAS library
• Used CCS format (“natural” for ICC)
41
Iterative Solution
ICC/PCG
60
48
50
47
46
40
Mflops
40
ICC
30
PCG
20
10
5.86
4.22
2.9
1.9
0
2000
4000
8000
16000
Matrix size
• ICC: a lot of “sparse overhead”
• Ongoing investigation (at MathWorks):
– Our compiler-generated ICC is 50-100 times faster
than Matlab implementation !!
42
Iterative solution (cont.)
• Preliminary comparison with IBM ESSL DSRIS
– DSRIS implements PCG (among other things)
• On BCSSTK30; have set values to vary the convergence
• BSCT ICC takes 1.28 secs
• ESSL DSRIS preprocessing (ILU+??) takes ~5.09 secs
• PCG iterations are ~15% faster in ESSL
43
Parallel experiments
• Conjugate Gradient algorithm
– vs BlockSolve library (Argonne NL)
• “Inspector” phase
– Pre-computes what communication needs to occur
– Done once, might be expensive
• “Executor” phase
– “Receive-compute-send-...”
• On Boeing-Harwell matrices
• On synthetic grid problems to understand scalability
44
Executor performance
Executor Performance (grid problems)
3
2
Bsolve
1
BSCT
0
2
4
8
Time (seconds)
Time (Seconds)
Executor Performance (BCSSTK32)
2.5
2
1.5
1
0.5
0
16
Number of Processors
Bsolve
BSCT
2
4
8
16
32
64
Number of processors
• Grid problems: problem size per processor is constant
– 135K rows, ~4.6M non-zeroes
• Within 2-4% of the library
45
Inspector overhead
3
2.5
2
1.5
1
0.5
0
Inspector Overhead (grid problems)
Bsolve
BSCT
HPF-2
2
4
8
Ratio
Ratio
Inspector Overhead (BCSSTK32)
10
8
6
4
2
0
16
Number of Processors
Bsolve
BSCT
HPF-2
2
4
8
16
32
64
Number of processors
• Ratio of the inspector to single iteration of the executor
– A problem-independent measure
• “HPF-2” -- new data-parallel Fortran standard
– Lowest-common denominator, inspectors are not scalable
46
Experiments: summary
• Sequential
– Competitive with SPBLAS library
• Parallel
– Inspector phase should exploit formats
(cf. HPF-2)
47
Outline
• Problem statement
• State-of-the-art restructuring compiler technology
• Technical approach and experiments
• Ongoing work and conclusions
48
Ongoing work
• Packaging
– “Library-on-demand”; as a Matlab toolbox
• Parallel code generation
– Extend to handle more kernels
• Core of the compiler
– Disjunctive queries, fill
49
Ongoing work
• Packaging
– “Library-on-demand”; as a Matlab toolbox
– Completely automatic tool; data structure selection
• Out-of-core computations
• Parallel code generation
– Extend to handle more kernels
• Core of the compiler
– Disjunctive queries, fill
50
Related work - compilers
• Polyhedral model [Lamport ’78, …]
• Sparse compilation [Bik,Wijshoff ‘92]
• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]
– Fixed data structures
– Separate compilation path for dense/sparse
• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]
51
Related work - languages and compilers
• Polyhedral model [Lamport ’78, …]
• Parallelizing compilers/parallel languages
– e.g. HPF, ZPL
• Transformational programming [Gries]
• Software engineering through views [Novak]
• Sparse compilation [Bik,Wijshoff ‘92]
• Support for sparse computations in HPF-2 [Saltz,Ujaldon et al]
• Data-centric blocking [Kodukula,Ahmed,Pingali ‘97]
52
Related work - languages and compilers
• Compilation of dense matrix (regular) codes
– Polyhedral model [Lamport ‘78, ...]
– Data-parallel languages (e.g. HPF, ZPL)
• Compilation of sparse matrix codes
– [Bik, Wijshoff] -- sequential sparse compiler
– [Saltz,Zima,Ujaldon,…] -- irregular computations in HPF-2
– Fixed data structures, not extensible
• Programming with ADTs
– SETL -- automatic data structure selection for set operations
– Transformational systems (e.g. Polya [Gries])
– Software reuse through views [Novak]
53
Related work - databases
• Optimizing loops in DB programming languages [Lieuwen,DeWitt‘90]
• Extensible database systems [Predator, …]
Predator
Utilities
BSR
CRS
...
COORD
Compiler
OPT
Images
OPT
Sequences
Relations
OPT
BSCT
...
Utilities
54
Conclusions/Contributions
• Sparse matrix computations are widely used in practice
• Code development is tedious and error-prone
• Bernoulli Sparse Compilation Toolkit:
– Arrays as relations, loops as queries
– Compilation as query optimization
• Algebras in optimizing compilers
Data-flow analysis
Lattice algebra
Dense matrix computations
Polyhedral algebra
Sparse matrix computations
Relational algebra
55
Future work
• Sparse compiler
– Completely automatic system, data structure selection
– Out-of-core computations
• Compilation of signal/image processing applications
– dense matrix computations + fast transforms (e.g. DCT)
– multiple data representations
• Programming languages and databases
– e.g. Java and JDBC
56
Future work
• Sparse compiler
– Completely automatic system, data structure selection
– Out-of-core computations
• Extensible compilation
– Want to support multiple formats, optimizations across basic ops
– Example: signal/image processing
57
Future work
• Application area: signal/image processing
– compositions of transforms, such as FFTs
– multiple data representations
– important to optimize the flow of data through memory
hierarchies
– IBM ESSL: FFT at close to peak performance
– algebra: Kronecker products
58
Future interests
Computational
Science
Compilers
Sparse compiler
Data mining
Database Programming
Systems
Databases
59
Download