When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP Project, U.C. Berkeley June, 2004 http://bebop.cs.berkeley.edu rajeshn@eecs.berkeley.edu General Approach • Sparse kernels are prevalent in many applications • Automatically tune kernels to get optimum performance – Naïve performance is typically less than 10% of machine peak – Performance dependent on matrix and platform – Numerous parameters that can be adjusted • Create analytic performance models • Use the performance models as a basis for a system that will automatically pick the correct parameters • Verify performance models using built-in hardware counters Introduction and Overview • Sparse Matrix Vector Multiply (SpM x V): y y+A∙x – x, y are dense vectors • We will call x the source vector and y the destination vector – A is a sparse matrix (<1% of entries are nonzero) • Naïve Performance of Pitfalls – High memory bandwidth requirements – Poor locality (indirect and irregular memory accesses) – Poor instruction mix (low flops to memory operations ratio) vector: x vector: y matrix: A Cache Blocking • • Assume a Compressed Sparse Row (CSR) format Cache blocking breaks the CSR matrix into multiple smaller CSR matrices – – • • Goal: Given a matrix and a machine combination, pick the correct cache block size Key Tradeoff – • Improves temporal locality in accesses to source vector Adds extra set of row pointers for each block Does the benefit of the added temporal locality outweigh the costs associated with accessing the added overhead An extension of the work done by Eun-Jin Im in the work on Sparsity r c New Cache Blocking Optimizations • Row Start/End – Band matrices may lead to many rows with all zeros – New optimization avoids iterating over these rows by adding starting row and ending row on which nonzero elements exist – Example: • Exploiting Cache Block Structure – Unlike previous versions of the code, treat each of the sub matrices as an individual sparse matrix – Allow for easier incorporation of different kernels and multiple levels of cache blocking – Negligible performance costs because of few number of cache blocks Related Work • Traditional dense methods to find tile sizes cannot be used because of indirect and irregular memory accesses. • Similar work in sparse case – Temam and Jalby, Heras et al., Fraguela et al. have developed probabilistic models that predict cache misses. – Distinguish themselves by their ability to account for self and cross interference misses but assume uniform distribution of non-zeros – Our models account for multiple levels in the memory system including the TLB and explicitly model the execution time • Other areas – Gropp et al. and Herber et. al use similar techniques to tune applications in other domains Analytic Models of the Memory System • Execution Time Model – Cache misses are modeled and verified with hardware counters – Charge αi for hits at each cache level • T = (L1 hits) α1 + (L2 hits) α2 + (mem hits) αmem + (TLB misses) αTLB – Cache Models • Lower bound on cache misses assume only compulsory misses • Upper bound is same as lower bound except that every access to the source and destination vectors miss • Based on models by Vuduc et al. in Super Computing 2002 Analytic Models of Memory System (cont.) • Models are too simple to show performance advantages of Cache Blocking – According to models, Cache Blocking has no benefit since the blocking adds over head to the data structure – Need to model one of the levels of the memory system more accurately to expose advantages of locality – Empirical evidence suggests largest performance gains using cache blocking come from minimizing TLB misses • Empirical evidence suggests two categories of block sizes that work For each row and column block size shown above, the value in the cell contains the number of matrices out of our 14 test matrices whose performance was within 90% of peak if that block size was chosen. Analytic Model Verification • 14 test matrices in our suite with varying density and structure – Except the dense matrix in sparse format, all the matrices are too sparse to be aided by register blocking – Source vectors for all matrices are much larger than largest level of cache – Wide range of applications including, Linear Programming, Web connectivity, and Latent Semantic Indexing. • 3 Platforms – Intel Itanium 2 – IBM Power 4 – Intel Pentium 3 • Where available we used the hardware counters through PAPI Model Verification • Analytic Model – – – • Over predicts performance by more than a factor of 2 Relative performance is well predicted implying that it can be used as the basis for a heuristic Cache Blocking improves performance of certain matrices and doesn’t yield improvements on others PAPI Model – – Instead of lower and upper bound models for cache and TLB misses true hardware counter values are used Still over predict performance implying that there is time that we are not accounting for. L3 Data Cache Miss Reduction • Matrices with largest speedups experience largest decreases in the number of cache misses • More than 1 order of magnitude less misses in some cases TLB Miss Reduction • Matrices with largest speedups experience largest decreases in the number of cache misses • More than 2 orders of magnitude less misses in some cases Speedup • For the same matrix speedup varies. • Platforms with larger processor memory gaps are improved by cache blocking • Matrices (5 and 6 experience the best speedups on all the platforms • Matrices (12-14) which are extremely sparse show very little speedup Sparse Band Matrices • Row Start/End optimization greatly reduces sensitivity to block size • Best Performance insensitive to row block size • Cache blocking is not useful for a banded matrix because there is less cache thrashing • RSE optimization forgives the choice of a bad block size When and Why Cache Blocking Works • Matrix Structure – Matrices that have a large column dimension aided with cache blocking • The gains through temporal locality in accesses to the source vector x far outweigh the costs of the added data structure – If the matrix is too sparse cache blocking does not help • There is not enough temporal locality to amortize the added overhead – Band matrices are not aided by cache blocking • Matrix already has optimal access pattern Platform Evaluation • Effectiveness of cache blocking varies across platforms for same matrix – As average number of cycles to access main memory goes up cache blocking helps since it reduces expensive accesses to main memory • TLB misses account for a non trivial part of the execution time – Larger page sizes would mitigate this problem • In SpM x V there are two types of access – Streaming access to matrix (only spatial locality exploited) • Keeping matrix in cache wastes space for source vector – Random access to source vector (both spatial and temporal locality exploited) • The more of the source vector that can stay in cache the better – Future work for machines without caches (SX1) or machines with uncached loads (Cray X1) Conclusions • • • • Cache Blocking of SpM x V works when the benefits of the added temporal locality in the source vector x outweighs the costs of the added data structure overhead We use analytic models to select cache block size – Models are good at predicting the relative performance – Row Start/End optimization forgives a bad block size choice Cache Blocking appears to be most effective on matrices in which the column dimension is a lot greater than the row dimension Banded matrices are not aided by cache blocking since the structure already lends itself to the optimal access pattern. • Full Technical Report • Project website: http://bebop.cs.berkeley.edu • My website: http://www.cs.berkeley.edu/~rajeshn – Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply (UCB/CSD-04-1335, June, 2004.) Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick References (1/3) • Full Technical Report – Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply (UCB/CSD-04-1335, June, 2004.) Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick • Other References – 1. S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November 2000. – 2. B. B. Fraguela, R. Doallo, and E. L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), 1999. – 3. W. D. Gropp, D. K. Kasushik, D. E. Keyes, and B. F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241{248, 1999. References (2/3) – 4. G. Heber, A. J. Dolgert, M. Alt, K. A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the Intel Itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, 2001. – 5. D. B. Heras, V. B. Perez, J. C. C. Dominguez, and F. F. Rivera. Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In HPCN Europe, pages 201{210, 1999. – 6. E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May 2000. – 7. Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, 1994. www.cs.umn.edu/Research/arpa/SPARSKIT/sparskit.html. – 8. R. H. Saavedra-Barrera. CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking. PhD thesis, University of California, Berkeley, February 1992. References (3/3) – 9. A. Snavely, L. Carrington, and N. Wolter. Modeling application performance by convolving machine signatures with application proles. 2001. – 10. O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing '92, 1992. – 11. R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November 2002. – 12. R. W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, 2003. This work was supported in part by the National Science Foundation under ACI-9619020 ACI0090127 and gifts from: Intel Corporation, Hewlett Packard, and Sun Microsystems. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.