MS - BeBOP (Berkeley Benchmarking and OPtimization)

advertisement
When Cache Blocking of Sparse Matrix
Vector Multiply Works and Why
By Rajesh Nishtala, Richard W. Vuduc,
James W. Demmel, and Katherine A. Yelick
BeBOP Project, U.C. Berkeley
June, 2004
http://bebop.cs.berkeley.edu
rajeshn@eecs.berkeley.edu
General Approach
• Sparse kernels are prevalent in many applications
• Automatically tune kernels to get optimum performance
– Naïve performance is typically less than 10% of machine peak
– Performance dependent on matrix and platform
– Numerous parameters that can be adjusted
• Create analytic performance models
• Use the performance models as a basis for a system that will
automatically pick the correct parameters
• Verify performance models using built-in hardware counters
Introduction and Overview
• Sparse Matrix Vector Multiply (SpM x V): y  y+A∙x
– x, y are dense vectors
• We will call x the source vector and y the destination vector
– A is a sparse matrix (<1% of entries are nonzero)
• Naïve Performance of Pitfalls
– High memory bandwidth requirements
– Poor locality (indirect and irregular memory accesses)
– Poor instruction mix (low flops to memory operations ratio)
vector: x
vector: y
matrix: A
Cache Blocking
•
•
Assume a Compressed Sparse Row (CSR) format
Cache blocking breaks the CSR matrix into multiple smaller CSR matrices
–
–
•
•
Goal: Given a matrix and a machine combination, pick the correct cache block
size
Key Tradeoff
–
•
Improves temporal locality in accesses to source vector
Adds extra set of row pointers for each block
Does the benefit of the added temporal locality outweigh the costs associated with
accessing the added overhead
An extension of the work done by Eun-Jin Im in the work on Sparsity
r
c
New Cache Blocking Optimizations
•
Row Start/End
– Band matrices may lead to many rows with all zeros
– New optimization avoids iterating over these rows by adding starting row
and ending row on which nonzero elements exist
– Example:
• Exploiting Cache Block Structure
– Unlike previous versions of the code, treat each of the sub matrices
as an individual sparse matrix
– Allow for easier incorporation of different kernels and multiple
levels of cache blocking
– Negligible performance costs because of few number of cache
blocks
Related Work
• Traditional dense methods to find tile sizes cannot be used
because of indirect and irregular memory accesses.
• Similar work in sparse case
– Temam and Jalby, Heras et al., Fraguela et al. have developed
probabilistic models that predict cache misses.
– Distinguish themselves by their ability to account for self and cross
interference misses but assume uniform distribution of non-zeros
– Our models account for multiple levels in the memory system
including the TLB and explicitly model the execution time
• Other areas
– Gropp et al. and Herber et. al use similar techniques to tune
applications in other domains
Analytic Models of the Memory System
• Execution Time Model
– Cache misses are modeled and verified with hardware
counters
– Charge αi for hits at each cache level
• T = (L1 hits) α1 + (L2 hits) α2
+ (mem hits) αmem + (TLB misses) αTLB
– Cache Models
• Lower bound on cache misses assume only compulsory misses
• Upper bound is same as lower bound except that every access to
the source and destination vectors miss
• Based on models by Vuduc et al. in Super Computing 2002
Analytic Models of Memory System (cont.)
•
Models are too simple to show
performance advantages of
Cache Blocking
– According to models, Cache
Blocking has no benefit since
the blocking adds over head to
the data structure
– Need to model one of the levels
of the memory system more
accurately to expose
advantages of locality
– Empirical evidence suggests
largest performance gains using
cache blocking come from
minimizing TLB misses
•
Empirical evidence suggests two
categories of block sizes that
work
For each row and column block size shown above,
the value in the cell contains the number of
matrices out of our 14 test matrices whose
performance was within 90% of peak if that block
size was chosen.
Analytic Model Verification
• 14 test matrices in our suite with varying density and structure
– Except the dense matrix in sparse format, all the matrices are too
sparse to be aided by register blocking
– Source vectors for all matrices are much larger than largest level of
cache
– Wide range of applications including, Linear Programming, Web
connectivity, and Latent Semantic Indexing.
• 3 Platforms
– Intel Itanium 2
– IBM Power 4
– Intel Pentium 3
• Where available we used the hardware counters through PAPI
Model Verification
•
Analytic Model
–
–
–
•
Over predicts performance by
more than a factor of 2
Relative performance is well
predicted implying that it can be
used as the basis for a heuristic
Cache Blocking improves
performance of certain matrices
and doesn’t yield improvements
on others
PAPI Model
–
–
Instead of lower and upper bound
models for cache and TLB misses
true hardware counter values are
used
Still over predict performance
implying that there is time that we
are not accounting for.
L3 Data Cache Miss Reduction
•
Matrices with largest
speedups experience
largest decreases in the
number of cache misses
•
More than 1 order of
magnitude less misses in
some cases
TLB Miss Reduction
•
Matrices with largest
speedups experience largest
decreases in the number of
cache misses
•
More than 2 orders of
magnitude less misses in
some cases
Speedup
•
For the same matrix speedup
varies.
•
Platforms with larger
processor memory gaps are
improved by cache blocking
•
Matrices (5 and 6 experience
the best speedups on all the
platforms
•
Matrices (12-14) which are
extremely sparse show very
little speedup
Sparse Band Matrices
•
Row Start/End optimization
greatly reduces sensitivity to
block size
•
Best Performance insensitive to
row block size
•
Cache blocking is not useful for
a banded matrix because there
is less cache thrashing
•
RSE optimization forgives the
choice of a bad block size
When and Why Cache Blocking Works
• Matrix Structure
– Matrices that have a large column dimension aided with cache
blocking
• The gains through temporal locality in accesses to the source vector x
far outweigh the costs of the added data structure
– If the matrix is too sparse cache blocking does not help
• There is not enough temporal locality to amortize the added overhead
– Band matrices are not aided by cache blocking
• Matrix already has optimal access pattern
Platform Evaluation
• Effectiveness of cache blocking varies across platforms for same
matrix
– As average number of cycles to access main memory goes up
cache blocking helps since it reduces expensive accesses to main
memory
• TLB misses account for a non trivial part of the execution time
– Larger page sizes would mitigate this problem
• In SpM x V there are two types of access
– Streaming access to matrix (only spatial locality exploited)
• Keeping matrix in cache wastes space for source vector
– Random access to source vector (both spatial and temporal locality
exploited)
• The more of the source vector that can stay in cache the better
– Future work for machines without caches (SX1) or machines with
uncached loads (Cray X1)
Conclusions
•
•
•
•
Cache Blocking of SpM x V works when the benefits of the added temporal
locality in the source vector x outweighs the costs of the added data structure
overhead
We use analytic models to select cache block size
–
Models are good at predicting the relative performance
–
Row Start/End optimization forgives a bad block size choice
Cache Blocking appears to be most effective on matrices in which the column
dimension is a lot greater than the row dimension
Banded matrices are not aided by cache blocking since the structure already
lends itself to the optimal access pattern.
•
Full Technical Report
•
Project website: http://bebop.cs.berkeley.edu
•
My website: http://www.cs.berkeley.edu/~rajeshn
–
Performance Modeling and Analysis of Cache Blocking in Sparse Matrix
Vector Multiply
(UCB/CSD-04-1335, June, 2004.)
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick
References (1/3)
•
Full Technical Report
– Performance Modeling and Analysis of Cache Blocking in Sparse
Matrix Vector Multiply
(UCB/CSD-04-1335, June, 2004.)
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick
•
Other References
– 1. S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable
crossplatform infrastructure for application performance tuning using
hardware counters. In Proceedings of Supercomputing, November 2000.
– 2. B. B. Fraguela, R. Doallo, and E. L. Zapata. Memory hierarchy
performance prediction for sparse blocked algorithms. Parallel Processing
Letters, 9(3), 1999.
– 3. W. D. Gropp, D. K. Kasushik, D. E. Keyes, and B. F. Smith. Towards
realistic bounds for implicit CFD codes. In Proceedings of Parallel
Computational Fluid Dynamics, pages 241{248, 1999.
References (2/3)
– 4. G. Heber, A. J. Dolgert, M. Alt, K. A. Mazurkiewicz, and L. Stringer.
Fracture mechanics on the Intel Itanium architecture: A case study. In
Workshop on EPIC Architectures and Compiler Technology (ACM MICRO
34), Austin, TX, 2001.
– 5. D. B. Heras, V. B. Perez, J. C. C. Dominguez, and F. F. Rivera. Modeling
and improving locality for irregular problems: sparse matrix-vector product
on cache memories as a case study. In HPCN Europe, pages 201{210,
1999.
– 6. E.-J. Im. Optimizing the performance of sparse matrix-vector
multiplication. PhD thesis, University of California, Berkeley, May 2000.
– 7. Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994.
www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html.
– 8. R. H. Saavedra-Barrera. CPU Performance Evaluation and Execution Time
Prediction Using Narrow Spectrum Benchmarking. PhD thesis, University of
California, Berkeley, February 1992.
References (3/3)
– 9. A. Snavely, L. Carrington, and N. Wolter. Modeling application
performance by convolving machine signatures with application proles.
2001.
– 10. O. Temam and W. Jalby. Characterizing the behavior of sparse
algorithms on caches. In Proceedings of Supercomputing '92, 1992.
– 11. R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee.
Performance optimizations and bounds for sparse matrix-vector multiply. In
Proceedings of Supercomputing, Baltimore, MD, USA, November 2002.
– 12. R. W. Vuduc. Automatic performance tuning of sparse matrix kernels.
PhD thesis, University of California, Berkeley, 2003.
This work was supported in part by the National Science Foundation under ACI-9619020 ACI0090127 and gifts from: Intel Corporation, Hewlett Packard, and Sun Microsystems.
The information presented here does not necessarily reflect the position or the policy of the
Government and no official endorsement should be inferred.
Download