Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV) James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr11 Outline • • • • • Motivation for Automatic Performance Tuning Results for sparse matrix kernels OSKI = Optimized Sparse Kernel Interface Tuning Higher Level Algorithms Future Work, Class Projects • BeBOP: Berkeley Benchmarking and Optimization Group – Many results shown from current and former members – Meet weekly T 12:30-2, in 511 Soda Motivation for Automatic Performance Tuning • Writing high performance software is hard – Make programming easier while getting high speed • Ideal: program in your favorite high level language (Matlab, Python, PETSc…) and get a high fraction of peak performance • Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… – Best choice can depend on knowing a lot of applied mathematics and computer science • How much of this can we teach? • How much of this can we automate? Examples of Automatic Performance Tuning (1) • Dense BLAS – – – – Sequential PHiPAC (UCB), then ATLAS (UTK) (used in Matlab) math-atlas.sourceforge.net/ Internal vendor tools • Fast Fourier Transform (FFT) & variations – Sequential and Parallel – FFTW (MIT) – www.fftw.org • Digital Signal Processing – SPIRAL: www.spiral.net (CMU) • Communication Collectives (UCB, UTK) • Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), … • More projects, conferences, government reports, … Examples of Automatic Performance Tuning (2) • What do dense BLAS, FFTs, signal processing, MPI reductions have in common? – Can do the tuning off-line: once per architecture, algorithm – Can take as much time as necessary (hours, a week…) – At run-time, algorithm choice may depend only on few parameters • Matrix dimension, size of FFT, etc. Tuning Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2i 2-D slice of 3-D space; implementations colorcoded by performance in Mflop/s 16 registers, but 2-by-3 tile size fastest Needle in a haystack Example: Select a Matmul Implementation Example: Support Vector Classification Machine Learning in Automatic Performance Tuning • References – Statistical Models for Empirical Search-Based Performance Tuning (International Journal of High Performance Computing Applications, 18 (1), pp. 65-94, February 2004) Richard Vuduc, J. Demmel, and Jeff A. Bilmes. – Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning (Computer Science PhD Thesis, University of California, Berkeley. UCB//EECS-2009-181 ) Archana Ganapathi Examples of Automatic Performance Tuning (3) • What do dense BLAS, FFTs, signal processing, MPI reductions have in common? – Can do the tuning off-line: once per architecture, algorithm – Can take as much time as necessary (hours, a week…) – At run-time, algorithm choice may depend only on few parameters • Matrix dimension, size of FFT, etc. • Can’t always do off-line tuning – Algorithm and implementation may strongly depend on data only known at run-time – Ex: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrixvector-multiplication (SpMV) – Part of search for best algorithm just be done (very quickly!) at run-time Source: Accelerator Cavity Design Problem (Ko via Husbands) Linear Programming Matrix … A Sparse Matrix You Encounter Every Day SpMV with Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1]-1 do y[i] = y[i] + val[k]*x[ind[k]] Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem Example: The Difficulty of Tuning • n = 21200 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem • 8x8 dense substructure Taking advantage of block structure in SpMV • Bottleneck is time to get matrix from memory – Only 2 flops for each nonzero in matrix • Don’t store each nonzero with index, instead store each nonzero r-by-c block with index – Storage drops by up to 2x, if rc >> 1, all 32-bit quantities – Time to fetch matrix from memory decreases • Change both data structure and algorithm – Need to pick r and c – Need to change algorithm accordingly • In example, is r=c=8 best choice? – Minimizes storage, so looks like a good idea… Speedups on Itanium 2: The Need for Search Mflop/s Best: 4x2 Reference Mflop/s Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s Power3 - 13% 195 Mflop/s Power4 - 14% 703 Mflop/s SpMV Performance (Matrix #2): Generation 1 100 Mflop/s Itanium 1 - 7% 225 Mflop/s 103 Mflop/s 469 Mflop/s Itanium 2 - 31% 1.1 Gflop/s 276 Mflop/s Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s Itanium 1 - 8% 247 Mflop/s 107 Mflop/s 459 Mflop/s Itanium 2 - 33% 1.2 Gflop/s 190 Mflop/s Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s Pentium III - 21% 108 Mflop/s 42 Mflop/s 50 Mflop/s Pentium III-M - 15% 122 Mflop/s 58 Mflop/s Another example of tuning challenges • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M Zoom in to top corner • More complicated non-zero structure in general • N = 16614 • NNZ = 1.1M 3x3 blocks look natural, but… • More complicated non-zero structure in general • Example: 3x3 blocking – Logical grid of 3x3 cells • But would lead to lots of “fill-in” Extra Work Can Improve Efficiency! • More complicated non-zero structure in general • Example: 3x3 blocking – – – – Logical grid of 3x3 cells Fill-in explicit zeros Unroll 3x3 block multiplies “Fill ratio” = 1.5 • On Pentium III: 1.5x speedup! – Actual mflop rate 1.52 = 2.25 higher Automatic Register Block Size Selection • Selecting the r x c block size – Off-line benchmark • Precompute Mflops(r,c) using dense A for each r x c • Once per machine/architecture – Run-time “search” • Sample A to estimate Fill(r,c) for each r x c – Run-time heuristic model • Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c) Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix – Fraction of matrix to sample: s [0,1] – Cost ~ O(s * nnz) – Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals – Idea: Monitor variance • Cost of tuning – Lower bound: convert matrix in 5 to 40 unblocked SpMVs – Heuristic: 1 to 11 SpMVs Accuracy of the Tuning Heuristics (1/4) See p. 375 of Vuduc’s thesis for matrices NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”) Accuracy of the Tuning Heuristics (2/4) DGEMV Accuracy of the Tuning Heuristics (2/4) Upper Bounds on Performance for blocked SpMV • P = (flops) / (time) – Flops = 2 * nnz(A) • Lower bound on time: Two main assumptions – 1. Count memory ops only (streaming) – 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge minimum access “latency” ai at Li cache & amem – e.g., Saavedra-Barrera and PMaC MAPS benchmarks Time a i Hits i a mem Hits mem i 1 a1 Loads (a i 1 a i ) Misses i (a mem a ) Misses i 1 Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00] Example: Bounds on Itanium 2 Example: Bounds on Itanium 2 Example: Bounds on Itanium 2 Summary of Other Sequential Performance Optimizations • Optimizations for SpMV – – – – – – – – Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 2.8x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… • Sparse triangular solve – Hybrid sparse/dense data structure: 1.8x over CSR • Higher-level kernels – A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB – A2·x: 2x over CSR, 1.5x over RB – [A·x, A2·x, A3·x, .. , Ak·x] Example: Sparse Triangular Factor • Raefsky4 (structural problem) + SuperLU + colmmd • N=19779, nnz=12.6 M Dense trailing triangle: dim=2268, 20% of total nz Can be as high as 90+%! 1.8x over CSR Cache Optimizations for AAT*x • Cache-level: Interleave multiplication by A, AT – Only fetch A from memory once T 1 T … 1 n T n … a AA x a a a n T x ai (ai x) i 1 “axpy” dot product • Register-level: aiT to be rc block row, or diag row Example: Combining Optimizations (1/2) • Register blocking, symmetry, multiple (k) vectors – Three low-level tuning parameters: r, c, v X k v * r c += Y A Example: Combining Optimizations (2/2) • Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB] – Symmetric, blocked, 1 vector • Up to 2.6x over nonsymmetric, blocked, 1 vector – Symmetric, blocked, k vectors • Up to 2.1x over nonsymmetric, blocked, k vectors • Up to 7.3x over nonsymmetric, nonblocked, 1 vector – Symmetric Storage: up to 64.7% savings Potential Impact on Applications: Omega3P • Application: accelerator cavity design [Ko] • Relevant optimization techniques – Symmetric storage – Register blocking – Reordering, to create more dense blocks • Reverse Cuthill-McKee ordering to reduce bandwidth – Do Breadth-First-Search, number nodes in reverse order visited • Traveling Salesman Problem-based ordering to create blocks – – – – – Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’97] • 2.1x speedup on Power 4 (caveat: SPMV not bottleneck) Source: Accelerator Cavity Design Problem (Ko via Husbands) Post-RCM Reordering 100x100 Submatrix Along Diagonal “Microscopic” Effect of RCM Reordering Before: Green + Red After: Green + Blue “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue (Omega3P) Optimized Sparse Kernel Interface - OSKI • Provides sparse kernels automatically tuned for user’s matrix & machine – BLAS-style functionality: SpMV, Ax & ATy, TrSV – Hides complexity of run-time tuning – Includes new, faster locality-aware kernels: ATAx, Akx • Faster than standard implementations – Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x • For “advanced” users & solver library writers – Available as stand-alone library (OSKI 1.0.1h, 6/07) – Available as PETSc extension (OSKI-PETSc .1d, 3/06) – Bebop.cs.berkeley.edu/oski • Under development: pOSKI for multicore How the OSKI Tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. How the OSKI Tunes (Overview) • At library build/install-time – Pre-generate and compile code variants into dynamic libraries – Collect benchmark data • Measures and records speed of possible sparse data structure and code variants on target architecture – Installation process uses standard, portable GNU AutoTools • At run-time – Library “tunes” using heuristic models • Models analyze user’s matrix & benchmark data to choose optimized data structure and code – Non-trivial tuning cost: up to ~40 mat-vecs • Library limits the time it spends tuning based on estimated workload – provided by user or inferred by library • User may reduce cost by saving tuning results for application on future runs with same or similar matrix Optimizations in OSKI, so far • Fully automatic heuristics for – Sparse matrix-vector multiply • Register-level blocking • Register-level blocking + symmetry + multiple vectors • Cache-level blocking – Sparse triangular solve with register-level blocking and “switch-todense” optimization – Sparse ATA*x with register-level blocking • User may select other optimizations manually – Diagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x) – All available in dynamic libraries – Accessible via high-level embedded script language • “Plug-in” extensibility – Very advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system • pOSKI under construction – Optimizations for multicore – more later How to Call OSKI: Basic Usage • May gradually migrate existing apps – Step 1: “Wrap” existing data structures – Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Compute y = ·y + a·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, a, x, , y ); How to Call OSKI: Basic Usage • May gradually migrate existing apps – Step 1: “Wrap” existing data structures – Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Step 1: Create OSKI wrappers around this data */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y = ·y + a·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, a, x, , y ); How to Call OSKI: Basic Usage • May gradually migrate existing apps – Step 1: “Wrap” existing data structures – Step 2: Make BLAS-like kernel calls int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */ double* x = …, *y = …; /* Let x and y be two dense vectors */ /* Step 1: Create OSKI wrappers around this data */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y = ·y + a·A·x, 500 times */ for( i = 0; i < 500; i++ ) oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);/* Step 2 */ How to Call OSKI: Tune with Explicit Hints • User calls “tune” routine – May provide explicit tuning hints (OPTIONAL) oski_matrix_t A_tunable = oski_CreateMatCSR( … ); /* … */ /* Tell OSKI we will call SpMV 500 times (workload hint) */ oski_SetHintMatMult(A_tunable, OP_NORMAL, a, x_view, , y_view, 500); /* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */ oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8); oski_TuneMat(A_tunable); /* Ask OSKI to tune */ for( i = 0; i < 500; i++ ) oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view); How the User Calls OSKI: Implicit Tuning • Ask library to infer workload – Library profiles all kernel calls – May periodically re-tune oski_matrix_t A_tunable = oski_CreateMatCSR( … ); /* … */ for( i = 0; i < 500; i++ ) { oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view); oski_TuneMat(A_tunable); /* Ask OSKI to tune */ } Quick-and-dirty Parallelism: OSKI-PETSc • Extend PETSc’s distributed memory SpMV (MATMPIAIJ) p0 • PETSc – Each process stores diag (all-local) and off-diag submatrices p1 • OSKI-PETSc: p2 p3 – Add OSKI wrappers – Each submatrix tuned independently OSKI-PETSc Proof-of-Concept Results • Matrix 1: Accelerator cavity design (R. Lee @ SLAC) – N ~ 1 M, ~40 M non-zeros – 2x2 dense block substructure – Symmetric • Matrix 2: Linear programming (Italian Railways) – Short-and-fat: 4k x 1M, ~11M non-zeros – Highly unstructured – Big speedup from cache-blocking: no native PETSc format • Evaluation machine: Xeon cluster – Peak: 4.8 Gflop/s per node Accelerator Cavity Matrix OSKI-PETSc Performance: Accel. Cavity Linear Programming Matrix … OSKI-PETSc Performance: LP Matrix Tuning SpMV for Multicore Architectural Review Cost of memory access in Multicore • When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies) – NUMA = Non-Uniform Memory Access – NUCA = Non-Uniform Cache Access • UCA & UMA architecture: Core Core Core Core Cache DRAM Source: Sam Williams Cost of memory access in Multicore • When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies) – NUMA = Non-Uniform Memory Access – NUCA = Non-Uniform Cache Access • UCA & UMA architecture: Core Core Core Core Cache DRAM Source: Sam Williams Cost of Memory Access in Multicore • When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies) – NUMA = Non-Uniform Memory Access – NUCA = Non-Uniform Cache Access • NUCA & UMA architecture: Core Core Core Core Cache Cache Cache Cache Memory Controller Hub DRAM Source: Sam Williams Cost of Memory Access in Multicore • When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies) – NUMA = Non-Uniform Memory Access – NUCA = Non-Uniform Cache Access • NUCA & NUMA architecture: Core Core Core Core Cache Cache Cache Cache DRAM DRAM Source: Sam Williams Cost of Memory Access in Multicore • Proper cache locality to maximize performance Core Core Core Core Cache Cache Cache Cache DRAM DRAM Source: Sam Williams Cost of Memory Access in Multicore • Proper DRAM locality to maximize performance Core Core Core Core Cache Cache Cache Cache DRAM DRAM Source: Sam Williams Multicore SMPs of Interest Multicore SMPs Used Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Source: Sam Williams Multicore SMPs Used (Conventional cache-based memory hierarchy) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Source: Sam Williams Multicore SMPs Used (Local store-based memory hierarchy) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Source: Sam Williams Multicore SMPs Used (CMT = Chip-MultiThreading) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Source: Sam Williams Multicore SMPs Used (threads) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 8 threads 8 threads Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 threads 16* threads Source: Sam Williams *SPEs only Multicore SMPs Used (peak double precision flops) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 75 GFlop/s 74 Gflop/s Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 19 GFlop/s 29* GFlop/s Source: Sam Williams *SPEs only Multicore SMPs Used (Total DRAM bandwidth) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 21 GB/s (read) 10 GB/s (write) 21 GB/s Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 42 GB/s (read) 21 GB/s (write) 51 GB/s Source: Sam Williams *SPEs only Multicore SMPs Used (Non-Uniform Memory Access - NUMA) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Source: Sam Williams *SPEs only Results from “Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)” Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007. Test matrices • • • Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number 2K x 2K Dense matrix stored in sparse format Dense Well Structured (sorted by nonzeros/row) Protein FEM / Spheres FEM / Cantilever FEM / Accelerator Circuit webbase Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming) LP Source: Sam Williams Wind Tunnel FEM / Harbor QCD FEM / Ship Economics Epidemiology SpMV Parallelization • How do we parallelize a matrix-vector multiplication ? Source: Sam Williams SpMV Parallelization How do we parallelize a matrix-vector multiplication ? By rows blocks No inter-thread data dependencies, but random access to x thread 3 thread 2 thread 1 thread 0 • • • Source: Sam Williams SpMV Performance (simple parallelization) • • • • Out-of-the box SpMV performance on a suite of 14 matrices Simplest solution = parallelization by rows (solves data dependency challenge) Scalability isn’t great Is this performance good? Naïve Pthreads Naïve Source: Sam Williams NUMA (Data Locality for Matrices) • On NUMA architectures, all large arrays should be partitioned either – explicitly (multiple malloc()’s + affinity) – implicitly (parallelize initialization and rely on first touch) • You cannot partition on granularities less than the page size – 512 elements on x86 – 2M elements on Niagara • • For SpMV, partition the matrix and perform multiple malloc()’s Pin submatrices so they are co-located with the cores tasked to process them Source: Sam Williams NUMA (Data Locality for Matrices) Source: Sam Williams Prefetch for SpMV • SW prefetch injects more MLP into the memory subsystem. • Supplement HW prefetchers • Can try to prefetch the – – – – values indices source vector or any combination thereof • In general, should only insert one prefetch per cache line (works best on unrolled code) Source: Sam Williams for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3; } SpMV Performance (NUMA and Software Prefetching) NUMA-aware allocation is essential on memorybound NUMA SMPs. Explicit software prefetching can boost bandwidth and change cache replacement policies Cell PPEs are likely latency-limited. used exhaustive search for best prefetch distance Source: Sam Williams Matrix Compression • Goal: minimize memory traffic • Register blocking – Choose block size to minimize memory traffic – Only power-of-2 block sizes – Simplifies search, achieves most of the possible speedup • Shorter indices – 32-bit, or 16-bit if possible • Different sparse matrix formats – BCSR – Block compressed sparse row • Like CSR but with register blocks – BCOO – Block coordinate • Stores row and column index of each register block • Better on very sparse sub-blocks (see cache blocking later) SpMV Performance (Matrix Compression) After maximizing memory bandwidth, the only hope is to minimize memory traffic. exploit: register blocking other formats smaller indices Use a traffic minimization heuristic rather than search Benefit is clearly matrix-dependent. Register blocking enables efficient software prefetching (one per cache line) Source: Sam Williams Cache blocking for SpMV The columns spanned by each cache block are selected to use same space in cache, i.e. access same number of x(i) • TLB blocking is a similar concept but instead of on 8 byte granularities, it uses 4KB granularities thread 1 • thread 2 Store entire submatrices contiguously thread 3 • thread 0 (Data Locality for Vectors) Source: Sam Williams Cache blocking for SpMV The columns spanned by each cache block are selected to use same space in cache, i.e. access same number of x(i) • TLB blocking is a similar concept but instead of on 8 byte granularities, it uses 4KB granularities thread 1 • thread 2 Store entire submatrices contiguously thread 3 • thread 0 (Data Locality for Vectors) Source: Sam Williams Auto-tuned SpMV Performance (cache and TLB blocking) • • Fully auto-tuned SpMV performance across the suite of matrices Why do some optimizations work better on some architectures? • matrices with naturally small working sets • architectures with giant caches +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Source: Sam Williams Auto-tuned SpMV Performance (architecture specific optimizations) • • • Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Source: Sam Williams Auto-tuned SpMV Performance (max speedup) • 2.7x 2.9x 4.0x 35x • • Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Source: Sam Williams Tuning Higher Level Algorithms than SpMV • We almost always do many SpMVs, not just one – “Krylov Subspace Methods” (KSMs) for Ax=b, Ax = λx • Conjugate Gradients, GMRES, Lanczos, … – Do a sequence of k SpMVs to get vectors [x1 , … , xk] – Find best solution x as linear combination of [x1 , … , xk] • Main cost is k SpMVs • Since communication usually dominates, can we do better? • Goal: make communication cost independent of k – Parallel case: O(log P) messages, not O(k log P) - optimal • same bandwidth as before – Sequential case: O(1) messages and bandwidth, not O(k) - optimal • Achievable when matrix partitionable with low surface-tovolume ratio Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4 … • Example: A tridiagonal, n=32, k=3 • Works for any “well-partitioned” A … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4 … • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4 … • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4 … • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4 … • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] A3·x A2·x A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Sequential Algorithm A3·x A2·x Step 1 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Sequential Algorithm A3·x A2·x Step 1 Step 2 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Sequential Algorithm A3·x A2·x Step 1 Step 2 Step 3 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Sequential Algorithm A3·x A2·x Step 1 Step 2 Step 3 Step 4 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 1 Proc 2 Proc 3 Proc 4 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 1 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 2 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 3 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 4 A·x x 1 2 3 4… • Example: A tridiagonal, n=32, k=3 … 32 Communication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, …, Akx] • Replace k iterations of y = Ax with [Ax, A2x, …, Akx] • Parallel Algorithm A3·x A2·x Proc 1 Proc 2 Proc 3 Proc 4 A·x x 1 2 3 4… … 32 • Example: A tridiagonal, n=32, k=3 • Entries in overlapping regions (triangles) computed redundantly Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Can be computed without communication k=8 fold reuse of A Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x One message to get data needed to compute remotely dependent entries, not k=8 Fewer Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal 2 processors Proc 1 Proc 2 A8x A7x A6x A5x A4x A3x A2x Ax x Reduce redundant work by half Remotely Dependent Entries for [x,Ax, A2x,A3x], 2D Laplacian Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors Performance Results • • • • Measured Multicore (Clovertown) speedups up to 6.4x Measured/Modeled sequential OOC speedup up to 3x Modeled parallel Petascale speedup up to 6.9x Modeled parallel Grid speedup up to 22x • Sequential speedup due to bandwidth, works for many problem sizes • Parallel speedup due to latency, works for smaller problems on many processors Speedups on Intel Clovertown (8 core) Avoiding Communication in Iterative Linear Algebra • k-steps of typical iterative solver for sparse Ax=b or Ax=λx – Does k SpMVs with starting vector – Finds “best” solution among all linear combinations of these k+1 vectors – Many such “Krylov Subspace Methods” • Conjugate Gradients, GMRES, Lanczos, Arnoldi, … • Goal: minimize communication in Krylov Subspace Methods – Assume matrix “well-partitioned,” with modest surface-to-volume ratio – Parallel implementation • Conventional: O(k log p) messages, because k calls to SpMV • New: O(log p) messages - optimal – Serial implementation • Conventional: O(k) moves of data from slow to fast memory • New: O(1) moves of data – optimal • Lots of speed up possible (modeled and measured) – Price: some redundant computation • Much prior work – See bebop.cs.berkeley.edu – CG: [van Rosendale, 83], [Chronopoulos and Gear, 89] – GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94] Minimizing Communication of GMRES to solve Ax=b • GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2 • Cost of k steps of standard GMRES vs new GMRES Standard GMRES for i=1 to k w = A · v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Communication-avoiding GMRES W = [ v, Av, A2v, … , Akv ] [Q,R] = TSQR(W) … “Tall Skinny QR” Build H from R, solve LSQ problem Sequential: #words_moved = O(k·nnz) from SpMV + O(k2·n) from MGS Parallel: #messages = O(k) from SpMV + O(k2 · log p) from MGS Sequential: #words_moved = O(nnz) from SpMV + O(k·n) from TSQR Parallel: #messages = O(1) from computing W + O(log p) from TSQR •Oops – W from power method, precision lost! “Monomial” basis [Ax,…,Akx] fails to converge A different polynomial basis does converge Speed ups of GMRES on 8-core Intel Clovertown Requires co-tuning kernels [MHDY09] Extensions • Other Krylov methods – Arnoldi, CG, Lanczos, … • Preconditioning – Solve MAx=Mb where preconditioning matrix M chosen to make system “easier” • M approximates A-1 somehow, but cheaply, to accelerate convergence – Cheap as long as contributions from “distant” parts of the system can be compressed • Sparsity • Low rank – No implementations yet (class projects!) Design Space for [x,Ax,…,Akx] (1/3) • Mathematical Operation – How many vectors to keep • All: Krylov Subspace Methods • Keep last vector Akx only (Jacobi, Gauss Seidel) – Improving conditioning of basis • W = [x, p1(A)x, p2(A)x,…,pk(A)x] • pi(A) = degree i polynomial chosen to reduce cond(W) – Preconditioning (Ay=b MAy=Mb) • [x,Ax,MAx,AMAx,MAMAx,…,(MA)kx] Design Space for [x,Ax,…,Akx] (2/3) • Representation of sparse A – Zero pattern may be explicit or implicit – Nonzero entries may be explicit or implicit • Implicit save memory, communication Explicit pattern Implicit pattern Explicit nonzeros General sparse matrix Image segmentation Implicit nonzeros Laplacian(graph) Multigrid (AMR) “Stencil matrix” Ex: tridiag(-1,2,-1) • Representation of dense preconditioners M – Low rank off-diagonal blocks (semiseparable) Design Space for [x,Ax,…,Akx] (3/3) • Parallel implementation – From simple indexing, with redundant flops surface/volume ratio – To complicated indexing, with fewer redundant flops • Sequential implementation – Depends on whether vectors fit in fast memory • Reordering rows, columns of A – Important in parallel and sequential cases – Can be reduced to pair of Traveling Salesmen Problems • Plus all the optimizations for one SpMV! Summary • Communication-Avoiding Linear Algebra (CALA) • Lots of related work – Some going back to 1960’s – Reports discuss this comprehensively, not here • Our contributions – Several new algorithms, improvements on old ones • Preconditioning – Unifying parallel and sequential approaches to avoiding communication – Time for these algorithms has come, because of growing communication costs – Why avoid communication just for linear algebra motifs? Possible Class Projects • Come to BEBOP meetings (T 12:30 – 2, 511 Soda) • Experiment with SpMV on GPU – Which optimizations are most effective? • Try to speed up particular matrices of interest – Data mining • Experiment with new [x,Ax,…,Akx] kernel – GPU, multicore, distributed memory – On matrices of interest • “Bottom solver” in multigrid / AMR (Chombo) – Experiment with solvers using this kernel • New Krylov subspace methods, preconditioning • Experiment with new frameworks (SPF) • See proposals for details Extra Slides Optimizing Communication Complexity of Sparse Solvers • Need to modify high level algorithms to use new kernel • Example: GMRES for Ax=b where A= 2D Laplacian – x lives on n-by-n mesh – Partitioned on p½ -by- p½ processor grid – A has “5 point stencil” (Laplacian) • (Ax)(i,j) = linear_combination(x(i,j), x(i,j±1), x(i±1,j)) – Ex: 18-by-18 mesh on 3-by-3 processor grid Minimizing Communication • What is the cost = (#flops, #words, #mess) of s steps of standard GMRES? GMRES, ver.1: for i=1 to s w = A * v(i-1) MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H n/p½ n/p½ • Cost(A * v) = s * (9n2 /p, 4n / p½ , 4 ) • Cost(MGS = Modified Gram-Schmidt) = s2/2 * ( 4n2 /p , log p , log p ) • Total cost ~ Cost( A * v ) + Cost (MGS) • Can we reduce the latency? Minimizing Communication • Cost(GMRES, ver.1) = Cost(A*v) + Cost(MGS) = ( 9sn2 /p, 4sn / p½ , 4s ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) • How much latency cost from A*v can you avoid? Almost all GMRES, ver. 2: W = [ v, Av, A2v, … , Asv ] [Q,R] = MGS(W) Build H from R, solve LSQ problem • Cost(W) = ( ~ same, ~ same , 8 ) • Latency cost independent of s – optimal • Cost (MGS) unchanged • Can we reduce the latency more? s=3 Minimizing Communication • Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) • How much latency cost from MGS can you avoid? Almost all GMRES, ver. 3: W = [ v, Av, A2v, … , Asv ] [Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11) Build H from R, solve LSQ problem W= W1 W2 W3 W4 R1 R2 R3 R4 R12 R1234 R34 • Cost(TSQR) = ( ~ same, ~ same , log p ) • Latency cost independent of s - optimal Minimizing Communication • Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) • How much latency cost from MGS can you avoid? Almost all GMRES, ver. 3: W = [ v, Av, A2v, … , Asv ] [Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11) Build H from R, solve LSQ problem W1 W = W2 W3 W4 R1 R2 R3 R4 R12 R1234 R34 • Cost(TSQR) = ( ~ same, ~ same , log p ) • Oops Minimizing Communication • Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS) = ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) • How much latency cost from MGS can you avoid? Almost all GMRES, ver. 3: W = [ v, Av, A2v, … , Asv ] [Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11) Build H from R, solve LSQ problem W1 W = W2 W3 W4 R1 R2 R3 R4 R12 R1234 R34 • Cost(TSQR) = ( ~ same, ~ same , log p ) • Oops – W from power method, precision lost! Minimizing Communication • Cost(GMRES, ver. 3) = Cost(W) + Cost(TSQR) = ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , log p ) • Latency cost independent of s, just log p – optimal • Oops – W from power method, so precision lost – What to do? • Use a different polynomial basis • Not Monomial basis W = [v, Av, A2v, …], instead … • Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] or • Chebyshev Basis WC = [v, T1(v), T2(v), …] Tuning Higher Level Algorithms • So far we have tuned a single sparse matrix kernel • y = AT*A*x motivated by higher level algorithm (SVD) • What can we do by extending tuning to a higher level? • Consider Krylov subspace methods for Ax=b, Ax = lx • Conjugate Gradients (CG), GMRES, Lanczos, … • Inner loop does y=A*x, dot products, saxpys, scalar ops • Inner loop costs at least O(1) messages • k iterations cost at least O(k) messages • Our goal: show how to do k iterations with O(1) messages • Possible payoff – make Krylov subspace methods much faster on machines with slow networks • Memory bandwidth improvements too (not discussed) • Obstacles: numerical stability, preconditioning, … Krylov Subspace Methods for Solving Ax=b • Compute a basis for a subspace V by doing y = A*x k times • Find “best” solution in that Krylov subspace V • Given starting vector x1, V spanned by x2 = A*x1, x3 = A*x2 , … , xk = A*xk-1 • GMRES finds an orthogonal basis of V by Gram-Schmidt, so it actually does a different set of SpMVs than in last bullet Example: Standard GMRES r = b - A*x1, = length(r), v1 = r / … length(r) = sqrt(S ri2 ) for m= 1 to k do w = A*vm … at least O(1) messages for i = 1 to m do … Gram-Schmidt him = dotproduct(vi , w ) … at least O(1) messages, or log(p) w = w – h im * vi end for hm+1,m = length(w) … at least O(1) messages, or log(p) vm+1 = w / hm+1,m end for find y minimizing length( Hk * y – e1 ) … small, local problem new x = x1 + Vk * y … Vk = [v1 , v2 , … , vk ] O(k2), or O(k2 log p), messages altogether Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal Different basis for same Krylov subspace What can Proc 1 compute without communication? Proc 2 Proc 1 . . . (A8x)(1:30): (A2x)(1:30): (Ax)(1:30): x(1:30): Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal Computing missing entries with 1 communication, redundant work Proc 1 . . . (A8x)(1:30): (A2x)(1:30): (Ax)(1:30): x(1:30): Proc 2 Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal Saving half the redundant work Proc 1 . . . (A8x)(1:30): (A2x)(1:30): (Ax)(1:30): x(1:30): Proc 2 Example: Computing [Ax,A2x,A3x,…,Akx] for Laplacian A = 5pt Laplacian in 2D, Communicated point for k=3 shown Latency-Avoiding GMRES (1) r = b - A*x1, = length(r), v1 = r / … O(log p) messages Wk+1 = [v1 , A * v1 , A2 * v1 , … , Ak * v1 ] … O(1) messages [Q, R] = qr(Wk+1) … QR decomposition, O(log p) messages Hk = R(:, 2:k+1) * (R(1:k,1:k))-1 … small, local problem find y minimizing length( Hk * y – e1 ) … small, local problem new x = x1 + Qk * y … local problem O(log p) messages altogether Independent of k Latency-Avoiding GMRES (2) [Q, R] = qr(Wk+1) … QR decomposition, O(log p) messages Easy, but not so stable way to do it: X(myproc) = Wk+1T(myproc) * Wk+1 (myproc) … local computation Y = sum_reduction(X(myproc)) … O(log p) messages … Y = Wk+1T* Wk+1 R = (cholesky(Y))T … small, local computation Q(myproc) = Wk+1 (myproc) * R-1 … local computation Numerical example (1) Diagonal matrix with n=1000, Aii from 1 down to 10-5 Instability as k grows, after many iterations Numerical Example (2) Partial remedy: restarting periodically (every 120 iterations) Other remedies: high precision, different basis than [v , A * v , … , Ak * v ] Operation Counts for [Ax,A2x,A3x,…,Akx] on p procs Problem Per-proc cost Standard Optimized 1D mesh #messages 2k 2 2k 2k #flops 5kn/p 5kn/p + 5k2 memory (k+4)n/p (k+4)n/p + 8k #messages 26k 26 6kn2p-2/3 + 12knp-1/3 + O(k) 6kn2p-2/3 + 12k2np-1/3 + O(k3) #flops 53kn3/p 53kn3/p + O(k2n2p-2/3) memory (k+28)n3/p+ 6n2p-2/3 + O(np-1/3) (k+28)n3/p + 168kn2p-2/3 + O(k2np-1/3) (tridiagonal) #words sent 3D mesh 27 pt stencil #words sent Summary and Future Work • Dense – LAPACK – ScaLAPACK • Communication primitives • Sparse – Kernels, Stencils – Higher level algorithms • All of the above on new architectures – Vector, SMPs, multicore, Cell, … • High level support for tuning – Specification language – Integration into compilers Extra Slides A Sparse Matrix You Encounter Every Day Who am I? I am a Big Repository Of useful And useless Facts alike. Who am I? (Hint: Not your e-mail inbox.) What about the Google Matrix? • Google approach – Approx. once a month: rank all pages using connectivity structure • Find dominant eigenvector of a matrix – At query-time: return list of pages ordered by rank • Matrix: A = aG + (1-a)(1/n)uuT – Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-a – G is n x n connectivity matrix [n billions] • gij is non-zero if page i links to page j • Normalized so each column sums to 1 • Very sparse: about 7—8 non-zeros per row (power law dist.) – u is a vector of all 1 values – Steady-state probability xi of landing on page i is solution to x = Ax • Approximate x by power method: x = Akx0 – In practice, k 25 Current Work • Public software release • Impact on library designs: Sparse BLAS, Trilinos, PETSc, … • Integration in large-scale applications – DOE: Accelerator design; plasma physics – Geophysical simulation based on Block Lanczos (ATA*X; LBL) • Systematic heuristics for data structure selection? • Evaluation of emerging architectures – Revisiting vector micros • Other sparse kernels – Matrix triple products, Ak*x • Parallelism • Sparse benchmarks (with UTK) [Gahvari & Hoemmen] • Automatic tuning of MPI collective ops [Nishtala, et al.] Summary of High-Level Themes • “Kernel-centric” optimization – Vs. basic block, trace, path optimization, for instance – Aggressive use of domain-specific knowledge • Performance bounds modeling – Evaluating software quality – Architectural characterizations and consequences • Empirical search – Hybrid on-line/run-time models • Statistical performance models – Exploit information from sampling, measuring Related Work • My bibliography: 337 entries so far • Sample area 1: Code generation – Generative & generic programming – Sparse compilers – Domain-specific generators • Sample area 2: Empirical search-based tuning – Kernel-centric • linear algebra, signal processing, sorting, MPI, … – Compiler-centric • profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization Future Directions (A Bag of Flaky Ideas) • Composable code generators and search spaces • New application domains – PageRank: multilevel block algorithms for topic-sensitive search? • New kernels: cryptokernels – rich mathematical structure germane to performance; lots of hardware • New tuning environments – Parallel, Grid, “whole systems” • Statistical models of application performance – Statistical learning of concise parametric models from traces for architectural evaluation – Compiler/automatic derivation of parametric models Possible Future Work • Different Architectures – New FP instruction sets (SSE2) – SMP / multicore platforms – Vector architectures • Different Kernels • Higher Level Algorithms – Parallel versions of kenels, with optimized communication – Block algorithms (eg Lanczos) – XBLAS (extra precision) • Produce Benchmarks – Augment HPCC Benchmark • Make it possible to combine optimizations easily for any kernel • Related tuning activities (LAPACK & ScaLAPACK) Review of Tuning by Illustration (Extra Slides) Splitting for Variable Blocks and Diagonals • Decompose A = A1 + A2 + … At – – – – Detect “canonical” structures (sampling) Split Tune each Ai Improve performance and save storage • New data structures – Unaligned block CSR • Relax alignment in rows & columns – Row-segmented diagonals Example: Variable Block Row (Matrix #12) 2.1x over CSR 1.8x over RB Example: Row-Segmented Diagonals 2x over CSR Mixed Diagonal and Block Structure Summary • Automated block size selection – Empirical modeling and search – Register blocking for SpMV, triangular solve, ATA*x • Not fully automated – Given a matrix, select splittings and transformations • Lots of combinatorial problems – TSP reordering to create dense blocks (Pinar ’97; Moon, et al. ’04) Extra Slides A Sparse Matrix You Encounter Every Day Who am I? I am a Big Repository Of useful And useless Facts alike. Who am I? (Hint: Not your e-mail inbox.) Problem Context • Sparse kernels abound – Models of buildings, cars, bridges, economies, … – Google PageRank algorithm • Historical trends – – – – Sparse matrix-vector multiply (SpMV): 10% of peak 2x faster with “hand-tuning” Tuning becoming more difficult over time Promise of automatic tuning: PHiPAC/ATLAS, FFTW, … • Challenges to high-performance – Not dense linear algebra! • Complex data structures: indirect, irregular memory access • Performance depends strongly on run-time inputs Key Questions, Ideas, Conclusions • How to tune basic sparse kernels automatically? – Empirical modeling and search • • • • Up to 4x speedups for SpMV 1.8x for triangular solve 4x for ATA*x; 2x for A2*x 7x for multiple vectors • What are the fundamental limits on performance? – Kernel-, machine-, and matrix-specific upper bounds • Achieve 75% or more for SpMV, limiting low-level tuning • Consequences for architecture? • General techniques for empirical search-based tuning? – Statistical models of performance Road Map • • • • • Sparse matrix-vector multiply (SpMV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Statistical models of performance Compressed Sparse Row (CSR) Storage Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j) for each row i for k=ptr[i] to ptr[i+1] do y[i] = y[i] + val[k]*x[ind[k]] Road Map • • • • • Sparse matrix-vector multiply (SpMV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Statistical models of performance Historical Trends in SpMV Performance • The Data – – – – Uniprocessor SpMV performance since 1987 “Untuned” and “Tuned” implementations Cache-based superscalar micros; some vectors LINPACK SpMV Historical Trends: Mflop/s Example: The Difficulty of Tuning • n = 21216 • nnz = 1.5 M • kernel: SpMV • Source: NASA structural analysis problem Still More Surprises • More complicated non-zero structure in general Still More Surprises • More complicated non-zero structure in general • Example: 3x3 blocking – Logical grid of 3x3 cells Historical Trends: Mixed News • Observations – – – – Good news: Moore’s law like behavior Bad news: “Untuned” is 10% peak or less, worsening Good news: “Tuned” roughly 2x better today, and improving Bad news: Tuning is complex – (Not really news: SpMV is not LINPACK) • Questions – Application: Automatic tuning? – Architect: What machines are good for SpMV? Road Map • Sparse matrix-vector multiply (SpMV) in a nutshell • Historical trends and the need for search • Automatic tuning techniques – SpMV [SC’02; IJHPCA ’04b] – Sparse triangular solve (SpTS) [ICS/POHLL ’02] – ATA*x [ICCS/WoPLA ’03] • Upper bounds on performance • Statistical models of performance SPARSITY: Framework for Tuning SpMV • SPARSITY: Automatic tuning for SpMV [Im & Yelick ’99] – General approach • Identify and generate implementation space • Search space using empirical models & experiments – Prototype library and heuristic for choosing register block size • Also: cache-level blocking, multiple vectors • What’s new? – New block size selection heuristic • Within 10% of optimal — replaces previous version – Expanded implementation space • Variable block splitting, diagonals, combinations – New kernels: sparse triangular solve, ATA*x, Ar*x Automatic Register Block Size Selection • Selecting the r x c block size – Off-line benchmark: characterize the machine • Precompute Mflops(r,c) using dense matrix for each r x c • Once per machine/architecture – Run-time “search”: characterize the matrix • Sample A to estimate Fill(r,c) for each r x c – Run-time heuristic model • Choose r, c to maximize Mflops(r,c) / Fill(r,c) • Run-time costs – Up to ~40 SpMVs (empirical worst case) DGEMV Accuracy of the Tuning Heuristics (1/4) NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”) DGEMV Accuracy of the Tuning Heuristics (2/4) DGEMV Accuracy of the Tuning Heuristics (3/4) DGEMV Accuracy of the Tuning Heuristics (4/4) Road Map • • • • Sparse matrix-vector multiply (SpMV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance – SC’02 • Statistical models of performance Motivation for Upper Bounds Model • Questions – Speedups are good, but what is the speed limit? • Independent of instruction scheduling, selection – What machines are “good” for SpMV? Upper Bounds on Performance: Blocked SpMV • P = (flops) / (time) – Flops = 2 * nnz(A) • Lower bound on time: Two main assumptions – 1. Count memory ops only (streaming) – 2. Count only compulsory, capacity misses: ignore conflicts • Account for line sizes • Account for matrix size and nnz • Charge min access “latency” ai at Li cache & amem – e.g., Saavedra-Barrera and PMaC MAPS benchmarks Time a i Hits i a mem Hits mem i 1 a1 Loads (a i 1 a i ) Misses i (a mem a ) Misses i 1 Example: Bounds on Itanium 2 Example: Bounds on Itanium 2 Example: Bounds on Itanium 2 Fraction of Upper Bound Across Platforms Achieved Performance and Machine Balance • Machine balance [Callahan ’88; McCalpin ’95] – Balance = Peak Flop Rate / Bandwidth (flops / double) • Ideal balance for mat-vec: 2 flops / double – For SpMV, even less Time a1 Loads (a i 1 a i ) Misses i (a mem a ) Misses i • SpMV ~ streaming – 1 / (avg load time to stream 1 array) ~ (bandwidth) – “Sustained” balance = peak flops / model bandwidth Where Does the Time Go? Time a i Hits i a mem Hits mem i 1 • Most time assigned to memory • Caches “disappear” when line sizes are equal – Strictly increasing line sizes Execution Time Breakdown: Matrix 40 Speedups with Increasing Line Size Summary: Performance Upper Bounds • What is the best we can do for SpMV? – Limits to low-level tuning of blocked implementations – Refinements? • What machines are good for SpMV? – Partial answer: balance characterization • Architectural consequences? – Example: Strictly increasing line sizes Road Map • • • • • • Sparse matrix-vector multiply (SpMV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Tuning other sparse kernels Statistical models of performance – FDO ’00; IJHPCA ’04a Statistical Models for Automatic Tuning • Idea 1: Statistical criterion for stopping a search – A general search model • Generate implementation • Measure performance • Repeat – Stop when probability of being within e of optimal falls below threshold • Can estimate distribution on-line • Idea 2: Statistical performance models – Problem: Choose 1 among m implementations at run-time – Sample performance off-line, build statistical model Example: Select a Matmul Implementation Example: Support Vector Classification Road Map • • • • • • • Sparse matrix-vector multiply (SpMV) in a nutshell Historical trends and the need for search Automatic tuning techniques Upper bounds on performance Tuning other sparse kernels Statistical models of performance Summary and Future Work Summary of High-Level Themes • “Kernel-centric” optimization – Vs. basic block, trace, path optimization, for instance – Aggressive use of domain-specific knowledge • Performance bounds modeling – Evaluating software quality – Architectural characterizations and consequences • Empirical search – Hybrid on-line/run-time models • Statistical performance models – Exploit information from sampling, measuring Related Work • My bibliography: 337 entries so far • Sample area 1: Code generation – Generative & generic programming – Sparse compilers – Domain-specific generators • Sample area 2: Empirical search-based tuning – Kernel-centric • linear algebra, signal processing, sorting, MPI, … – Compiler-centric • profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization Future Directions (A Bag of Flaky Ideas) • Composable code generators and search spaces • New application domains – PageRank: multilevel block algorithms for topic-sensitive search? • New kernels: cryptokernels – rich mathematical structure germane to performance; lots of hardware • New tuning environments – Parallel, Grid, “whole systems” • Statistical models of application performance – Statistical learning of concise parametric models from traces for architectural evaluation – Compiler/automatic derivation of parametric models Acknowledgements • Super-advisors: Jim and Kathy • Undergraduate R.A.s: Attila, Ben, Jen, Jin, Michael, Rajesh, Shoaib, Sriram, Tuyet-Linh • See pages xvi—xvii of dissertation. TSP-based Reordering: Before (Pinar ’97; Moon, et al ‘04) TSP-based Reordering: After (Pinar ’97; Moon, et al ‘04) Up to 2x speedups over CSR Example: L2 Misses on Itanium 2 Misses measured using PAPI [Browne ’00] Example: Distribution of Blocked Non-Zeros Register Profile: Itanium 2 1190 Mflop/s 190 Mflop/s Ultra 2i - 11% 72 Mflop/s Ultra 3 - 5% 90 Mflop/s Register Profiles: Sun and Intel x86 35 Mflop/s Pentium III - 21% 108 Mflop/s 42 Mflop/s 50 Mflop/s Pentium III-M - 15% 122 Mflop/s 58 Mflop/s Power3 - 17% 252 Mflop/s Power4 - 16% 820 Mflop/s Register Profiles: IBM and Intel IA-64 122 Mflop/s Itanium 1 - 8% 247 Mflop/s 107 Mflop/s 459 Mflop/s Itanium 2 - 33% 1.2 Gflop/s 190 Mflop/s Accurate and Efficient Adaptive Fill Estimation • Idea: Sample matrix – Fraction of matrix to sample: s [0,1] – Cost ~ O(s * nnz) – Control cost by controlling s • Search at run-time: the constant matters! • Control s automatically by computing statistical confidence intervals – Idea: Monitor variance • Cost of tuning – Lower bound: convert matrix in 5 to 40 unblocked SpMVs – Heuristic: 1 to 11 SpMVs Sparse/Dense Partitioning for SpTS • Partition L into sparse (L1,L2) and dense LD: L1 L2 x1 b1 LD x2 b2 • Perform SpTS in three steps: L1 x1 b1 (2) bˆ2 b2 L2 x1 (3) L x bˆ (1) D 2 2 • Sparsity optimizations for (1)—(2); DTRSV for (3) • Tuning parameters: block size, size of dense triangle SpTS Performance: Power3 Summary of SpTS and AAT*x Results • SpTS — Similar to SpMV – 1.8x speedups; limited benefit from low-level tuning • AATx, ATAx – Cache interleaving only: up to 1.6x speedups – Reg + cache: up to 4x speedups • 1.8x speedup over register only – Similar heuristic; same accuracy (~ 10% optimal) – Further from upper bounds: 60—80% • Opportunity for better low-level tuning a la PHiPAC/ATLAS • Matrix triple products? Ak*x? – Preliminary work Register Blocking: Speedup Register Blocking: Performance Register Blocking: Fraction of Peak Example: Confidence Interval Estimation Costs of Tuning Splitting + UBCSR: Pentium III Splitting + UBCSR: Power4 Splitting+UBCSR Storage: Power4 Example: Variable Block Row (Matrix #13) Dense Tuning is Hard, Too • Even dense matrix multiply can be notoriously difficult to tune Dense matrix multiply: surprising performance as register tile size varies. Preliminary Results (Matrix Set 2): Itanium 2 Web/IR Dense FEM FEM (var) Bio Econ Stat LP Multiple Vector Performance What about the Google Matrix? • Google approach – Approx. once a month: rank all pages using connectivity structure • Find dominant eigenvector of a matrix – At query-time: return list of pages ordered by rank • Matrix: A = aG + (1-a)(1/n)uuT – Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-a – G is n x n connectivity matrix [n 3 billion] • gij is non-zero if page i links to page j • Normalized so each column sums to 1 • Very sparse: about 7—8 non-zeros per row (power law dist.) – u is a vector of all 1 values – Steady-state probability xi of landing on page i is solution to x = Ax • Approximate x by power method: x = Akx0 – In practice, k 25 MAPS Benchmark Example: Power4 MAPS Benchmark Example: Itanium 2 Saavedra-Barrera Example: Ultra 2i Summary of Results: Pentium III Summary of Results: Pentium III (3/3) Execution Time Breakdown (PAPI): Matrix 40 Preliminary Results (Matrix Set 1): Itanium 2 Dense FEM FEM (var) Assorted LP Tuning Sparse Triangular Solve (SpTS) • Compute x=L-1*b where L sparse lower triangular, x & b dense • L from sparse LU has rich dense substructure – Dense trailing triangle can account for 20—90% of matrix non-zeros • SpTS optimizations – Split into sparse trapezoid and dense trailing triangle – Use tuned dense BLAS (DTRSV) on dense triangle – Use Sparsity register blocking on sparse part • Tuning parameters – Size of dense trailing triangle – Register block size Sparse Kernels and Optimizations • Kernels – – – – Sparse matrix-vector multiply (SpMV): y=A*x Sparse triangular solve (SpTS): x=T-1*b – – – – – – Register blocking Cache blocking Multiple dense vectors (x) A has special structure (e.g., symmetric, banded, …) Hybrid data structures (e.g., splitting, switch-to-dense, …) Matrix reordering y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data Cache Blocked SpMV on LSI Matrix: Ultra 2i A 10k x 255k 3.7M non-zeros Baseline: 16 Mflop/s Best block size & performance: 16k x 64k 28 Mflop/s Cache Blocking on LSI Matrix: Pentium 4 A 10k x 255k 3.7M non-zeros Baseline: 44 Mflop/s Best block size & performance: 16k x 16k 210 Mflop/s Cache Blocked SpMV on LSI Matrix: Itanium A 10k x 255k 3.7M non-zeros Baseline: 25 Mflop/s Best block size & performance: 16k x 32k 72 Mflop/s Cache Blocked SpMV on LSI Matrix: Itanium 2 A 10k x 255k 3.7M non-zeros Baseline: 170 Mflop/s Best block size & performance: 16k x 65k 275 Mflop/s Inter-Iteration Sparse Tiling (1/3) x1 t1 y1 x2 t2 y2 x3 t3 y3 x4 t4 y4 x5 t5 y5 • [Strout, et al., ‘01] • Let A be 6x6 tridiagonal • Consider y=A2x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij Inter-Iteration Sparse Tiling (2/3) x1 t1 y1 x2 t2 y2 x3 x4 t3 t4 y3 y4 • [Strout, et al., ‘01] • Let A be 6x6 tridiagonal • Consider y=A2x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij • Orange = everything needed to compute y1 – Reuse a11, a12 x5 t5 y5 Inter-Iteration Sparse Tiling (3/3) x1 t1 y1 x2 t2 y2 x3 x4 t3 t4 y3 y4 • [Strout, et al., ‘01] • Let A be 6x6 tridiagonal • Consider y=A2x – t=Ax, y=At • Nodes: vector elements • Edges: matrix elements aij • Orange = everything needed to compute y1 – Reuse a11, a12 x5 t5 y5 • Grey = y2, y3 – Reuse a23, a33, a43 Inter-Iteration Sparse Tiling: Issues x1 t1 y1 x2 t2 y2 x3 t3 y3 x4 t4 y4 x5 t5 y5 • Tile sizes (colored regions) grow with no. of iterations and increasing out-degree – G likely to have a few nodes with high out-degree (e.g., Yahoo) • Mathematical tricks to limit tile size? – Judicious dropping of edges [Ng’01] Summary and Questions • Need to understand matrix structure and machine – BeBOP: suite of techniques to deal with different sparse structures and architectures • Google matrix problem – Established techniques within an iteration – Ideas for inter-iteration optimizations – Mathematical structure of problem may help • Questions – Structure of G? – What are the computational bottlenecks? – Enabling future computations? • E.g., topic-sensitive PageRank multiple vector version [Haveliwala ’02] – See www.cs.berkeley.edu/~richie/bebop/intel/google for more info, including more complete Itanium 2 results. Exploiting Matrix Structure • Symmetry (numerical or structural) – Reuse matrix entries – Can combine with register blocking, multiple vectors, … • Matrix splitting – Split the matrix, e.g., into r x c and 1 x 1 – No fill overhead • Large matrices with random structure – E.g., Latent Semantic Indexing (LSI) matrices – Technique: cache blocking • Store matrix as 2i x 2j sparse submatrices • Effective when x vector is large • Currently, search to find fastest size Symmetric SpMV Performance: Pentium 4 SpMV with Split Matrices: Ultra 2i Cache Blocking on Random Matrices: Itanium Speedup on four banded random matrices. Sparse Kernels and Optimizations • Kernels – – – – Sparse matrix-vector multiply (SpMV): y=A*x Sparse triangular solve (SpTS): x=T-1*b – – – – – – Register blocking Cache blocking Multiple dense vectors (x) A has special structure (e.g., symmetric, banded, …) Hybrid data structures (e.g., splitting, switch-to-dense, …) Matrix reordering y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data Register Blocked SpMV: Pentium III Register Blocked SpMV: Ultra 2i Register Blocked SpMV: Power3 Register Blocked SpMV: Itanium Possible Optimization Techniques • Within an iteration, i.e., computing (G+uuT)*x once – Cache block G*x • On linear programming matrices and matrices with random structure (e.g., LSI), 1.5—4x speedups • Best block size is matrix and machine dependent – Reordering and/or splitting of G to separate dense structure (rows, columns, blocks) • Between iterations, e.g., (G+uuT)2x – (G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTx • Compute Gu, uTG, uTu once for all iterations • G2x: Inter-iteration tiling to read G only once Multiple Vector Performance: Itanium Sparse Kernels and Optimizations • Kernels – – – – Sparse matrix-vector multiply (SpMV): y=A*x Sparse triangular solve (SpTS): x=T-1*b – – – – – – Register blocking Cache blocking Multiple dense vectors (x) A has special structure (e.g., symmetric, banded, …) Hybrid data structures (e.g., splitting, switch-to-dense, …) Matrix reordering y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data SpTS Performance: Itanium (See POHLL ’02 workshop paper, at ICS ’02.) Sparse Kernels and Optimizations • Kernels – – – – Sparse matrix-vector multiply (SpMV): y=A*x Sparse triangular solve (SpTS): x=T-1*b – – – – – – Register blocking Cache blocking Multiple dense vectors (x) A has special structure (e.g., symmetric, banded, …) Hybrid data structures (e.g., splitting, switch-to-dense, …) Matrix reordering y=AAT*x, y=ATA*x Powers (y=Ak*x), sparse triple-product (R*A*RT), … • Optimization techniques (implementation space) • How and when do we search? – Off-line: Benchmark implementations – Run-time: Estimate matrix properties, evaluate performance models based on benchmark data Optimizing AAT*x • Kernel: y=AAT*x, where A is sparse, x & y dense – Arises in linear programming, computation of SVD – Conventional implementation: compute z=AT*x, y=A*z • Elements of A can be reused: a1T y a1 an aT n n x ak ( akT x ) k 1 • When ak represent blocks of columns, can apply register blocking. Optimized AAT*x Performance: Pentium III Current Directions • Applying new optimizations – Other split data structures (variable block, diagonal, …) – Matrix reordering to create block structure – Structural symmetry • New kernels (triple product RART, powers Ak, …) • Tuning parameter selection • Building an automatically tuned sparse matrix library – Extending the Sparse BLAS – Leverage existing sparse compilers as code generation infrastructure – More thoughts on this topic tomorrow Related Work • Automatic performance tuning systems – PHiPAC [Bilmes, et al., ’97], ATLAS [Whaley & Dongarra ’98] – FFTW [Frigo & Johnson ’98], SPIRAL [Pueschel, et al., ’00], UHFFT [Mirkovic and Johnsson ’00] – MPI collective operations [Vadhiyar & Dongarra ’01] • Code generation – FLAME [Gunnels & van de Geijn, ’01] – Sparse compilers: [Bik ’99], Bernoulli [Pingali, et al., ’97] – Generic programming: Blitz++ [Veldhuizen ’98], MTL [Siek & Lumsdaine ’98], GMCL [Czarnecki, et al. ’98], … • Sparse performance modeling – [Temam & Jalby ’92], [White & Saddayappan ’97], [Navarro, et al., ’96], [Heras, et al., ’99], [Fraguela, et al., ’99], … More Related Work • Compiler analysis, models – CROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout ’01] – TUNE [Chatterjee, et al.] – Iterative compilation [O’Boyle, et al., ’98] – Broadway compiler [Guyer & Lin, ’99] – [Brewer ’95], ADAPT [Voss ’00] • Sparse BLAS interfaces – – – – BLAST Forum (Chapter 3) NIST Sparse BLAS [Remington & Pozo ’94]; SparseLib++ SPARSKIT [Saad ’94] Parallel Sparse BLAS [Fillipone, et al. ’96] Context: Creating High-Performance Libraries • Application performance dominated by a few computational kernels • Today: Kernels hand-tuned by vendor or user • Performance tuning challenges – Performance is a complicated function of kernel, architecture, compiler, and workload – Tedious and time-consuming • Successful automated approaches – Dense linear algebra: ATLAS/PHiPAC – Signal processing: FFTW/SPIRAL/UHFFT Cache Blocked SpMV on LSI Matrix: Itanium Sustainable Memory Bandwidth Multiple Vector Performance: Pentium 4 Multiple Vector Performance: Itanium Multiple Vector Performance: Pentium 4 Optimized AAT*x Performance: Ultra 2i Optimized AAT*x Performance: Pentium 4 Tuning Pays Off—PHiPAC Tuning pays off – ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest Register Tile Sizes (Dense Matrix Multiply) 333 MHz Sun Ultra 2i 2-D slice of 3-D space; implementations colorcoded by performance in Mflop/s 16 registers, but 2-by-3 tile size fastest High Precision GEMV (XBLAS) High Precision Algorithms (XBLAS) • Double-double (High precision word represented as pair of doubles) – Many variations on these algorithms; we currently use Bailey’s • Exploiting Extra-wide Registers – Suppose s(1) , … , s(n) have f-bit fractions, SUM has F>f bit fraction – Consider following algorithm for S = Si=1,n s(i) • Sort so that |s(1)| |s(2)| … |s(n)| • SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM – Theorem (D., Hida) Suppose F<2f (less than double precision) • If n 2F-f + 1, then error 1.5 ulps • If n = 2F-f + 2, then error 22f-F ulps (can be 1) • If n 2F-f + 3, then error can be arbitrary (S 0 but sum = 0 ) – Examples • s(i) double (f=53), SUM double extended (F=64) – accurate if n 211 + 1 = 2049 • Dot product of single precision x(i) and y(i) – s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) – accurate if n 216 + 1 = 65537 More Extra Slides Opteron (1.4GHz, 2.8GFlop peak) Nonsymmetric peak = 504 MFlops Symmetric peak = 612 MFlops Beat ATLAS DGEMV’s 365 Mflops Awards • Best Paper, Intern. Conf. Parallel Processing, 2004 – “Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiply” • Best Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries, 2003 – Best Student Presentation too, to Richard Vuduc – “Automatic performance tuning and analysis of sparse triangular solve” • Finalist, Best Student Paper, Supercomputing 2002 – To Richard Vuduc – “Performance Optimization and Bounds for Sparse Matrix-vector Multiply” • Best Presentation Prize, MICRO-33: 3rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000 – To Richard Vuduc – “Statistical Modeling of Feedback Data in an Automatic Tuning System” Accuracy of the Tuning Heuristics (4/4) DGEMV Can Match DGEMV Performance Fraction of Upper Bound Across Platforms Achieved Performance and Machine Balance • Machine balance [Callahan ’88; McCalpin ’95] – Balance = Peak Flop Rate / Bandwidth (flops / double) – Lower is better (I.e. can hope to get higher fraction of peak flop rate) • Ideal balance for mat-vec: 2 flops / double – For SpMV, even less Where Does the Time Go? Time a i Hits i a mem Hits mem i 1 • Most time assigned to memory • Caches “disappear” when line sizes are equal – Strictly increasing line sizes Execution Time Breakdown: Matrix 40 Execution Time Breakdown (PAPI): Matrix 40 Speedups with Increasing Line Size Summary: Performance Upper Bounds • What is the best we can do for SpMV? – Limits to low-level tuning of blocked implementations – Refinements? • What machines are good for SpMV? – Partial answer: balance characterization • Architectural consequences? – Help to have strictly increasing line sizes Evaluating algorithms and machines for SpMV • Metrics – Speedups – Mflop/s (“fair” flops) – Fraction of peak • Questions – Speedups are good, but what is “the best?” • Independent of instruction scheduling, selection • Can SpMV be further improved or not? – What machines are “good” for SpMV? Tuning Dense BLAS —PHiPAC Tuning Dense BLAS– ATLAS Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK) Statistical Models for Automatic Tuning • Idea 1: Statistical criterion for stopping a search – A general search model • Generate implementation • Measure performance • Repeat – Stop when probability of being within e of optimal falls below threshold • Can estimate distribution on-line • Idea 2: Statistical performance models – Problem: Choose 1 among m implementations at run-time – Sample performance off-line, build statistical model SpMV Historical Trends: Fraction of Peak Motivation for Automatic Performance Tuning of SpMV • Historical trends – Sparse matrix-vector multiply (SpMV): 10% of peak or less • Performance depends on machine, kernel, matrix – Matrix known at run-time – Best data structure + implementation can be surprising • Our approach: empirical performance modeling and algorithm search Accuracy of the Tuning Heuristics (3/4) DGEMV Accuracy of the Tuning Heuristics (3/4) Sparse CALA Summary • Optimizing one SpMV too limiting • Take k steps of Krylov subspace method – GMRES, CG, Lanczos, Arnoldi – Assume matrix “well-partitioned,” with modest surface-to-volume ratio – Parallel implementation • Conventional: O(k log p) messages • CALA: O(log p) messages - optimal – Serial implementation • Conventional: O(k) moves of data from slow to fast memory • CALA: O(1) moves of data – optimal – Can incorporate some preconditioners • Need to be able to “compress” interactions between distant i, j • Hierarchical, semiseparable matrices … – Lots of speed up possible (measured and modeled) • Price: some redundant computation Potential Impact on Applications: T3P • Application: accelerator design [Ko] • 80% of time spent in SpMV • Relevant optimization techniques – Symmetric storage – Register blocking • On Single Processor Itanium 2 – 1.68x speedup • 532 Mflops, or 15% of 3.6 GFlop peak – 4.4x speedup with multiple (8) vectors • 1380 Mflops, or 38% of peak