Autotuning sparse matrix kernels Richard Vuduc Center for Applied Scientific Computing (CASC) Lawrence Livermore National Laboratory February 28, 2007 Predictions (2003) Need for “autotuning” will increase over time Improve performance for given app & machine using automated experiments Example: Sparse matrix-vector multiply (SpMV), 1987 to present Untuned: 10% of peak or less, decreasing Tuned: 2x speedup, increasing over time Tuning is getting harder (qualitative) More complex machines & workloads Parallelism Trends in uniprocessor SpMV performance (Mflop/s), pre-2004 Trends in uniprocessor SpMV performance (Mflop/s), pre-2004 Trends in uniprocessor SpMV performance (fraction of peak) Is tuning getting easier? // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t = 0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8x8 dense blocks • As r x c , Mflop/s Speedups on Itanium 2: The need for search Mflop/s (31.1%) Reference Mflop/s (7.6%) Speedups on Itanium 2: The need for search Mflop/s (31.1%) Best: 4x2 Reference Mflop/s (7.6%) SpMV Performance—raefsky3 SpMV Performance—raefsky3 Better, worse, or about the same? Itanium 2, 900 MHz 1.3 GHz Better, worse, or about the same? Itanium 2, 900 MHz 1.3 GHz * Reference improves * Better, worse, or about the same? Itanium 2, 900 MHz 1.3 GHz * Best possible worsens slightly * Better, worse, or about the same? Pentium M Core 2 Duo (1-core) Better, worse, or about the same? Pentium M Core 2 Duo (1-core) * Reference & best improve; relative speedup improves (~1.4 to 1.6) * Better, worse, or about the same? Pentium M Core 2 Duo (1-core) * Note: Best fraction of peak decreased from 11% 9.6% * Better, worse, or about the same? Power4 Power5 Better, worse, or about the same? Power4 Power5 * Reference worsens! * Better, worse, or about the same? Power4 Power5 * Relative importance of tuning increases * A framework for performance tuning Source: SciDAC Performance Engineering Research Institute (PERI) Outline Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work Outline Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work OSKI: Optimized Sparse Kernel Interface Autotuned kernels for user’s matrix & machine BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), … Hides complexity of run-time tuning Includes fast locality-aware kernels: ATA*x, … Faster than standard implementations Standard SpMV < 10% peak, vs. up to 31% with OSKI Up to 4x faster SpMV, 1.8x TrSV, 4x ATA*x, … For “advanced” users & solver library writers PETSc extension available (OSKI-PETSc) Kokkos (for Trilinos) by Heroux Adopted by ClearShape, Inc. for shipping product (2x speedup) Tunable matrix-specific optimization techniques Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB Tuning for workloads Bi-conjugate gradients - equal mix of A*x and AT*y 3x1: Ax, ATy = 1053, 343 Mflop/s 517 Mflop/s 3x3: Ax, ATy = 806, 826 Mflop/s 816 Mflop/s Higher-level operation - (Ax, ATy) kernel 3x1: 757 Mflop/s 3x3: 1400 Mflop/s Matrix powers (Ak*x) with data structure transformations A2*x: up to 2x faster New latency-tolerant solvers? (Hoemmen’s thesis, on-going at UCB) How OSKI tunes (Overview) Library Install-Time (offline) Application Run-Time How OSKI tunes (Overview) Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Application Run-Time How OSKI tunes (Overview) Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models How OSKI tunes (Overview) Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system. Application Run-Time Library Install-Time (offline) 1. Build for Target Arch. 2. Benchmark Generated code variants Benchmark data Matrix Workload from program monitoring History 1. Evaluate Models Heuristic models 2. Select Data Struct. & Code To user: Matrix handle for kernel calls OSKI’s place in the tuning framework Examples of OSKI’s early impact Early adopter: ClearShape, Inc. Core product: lithography simulator 2x speedup on full simulation after using OSKI Proof-of-concept: SLAC T3P accelerator cavity design simulator SpMV dominates execution time Symmetry, 2x2 block structure 2x speedups OSKI-PETSc Performance: Accel. Cavity Strengths and limitations of the library approach Strengths Isolates optimization in the library for portable performance Exploits domain-specific information aggressively Handles run-time tuning naturally Limitations “Generation Me”: What about my application and its abstractions? Run-time tuning: run-time overheads Limited context for optimization (without delayed evaluation) Limited extensibility (fixed interfaces) Outline Motivation OSKI: An autotuned sparse kernel library Application-specific optimization “in the wild” Toward end-to-end application autotuning Summary and future work Tour of application-specific optimizations Five case studies Common characteristics Complex code Heavy use of abstraction Use generated code (e.g., SWIG C++/Python bindings) Benefit from extensive code and data restructuring Multiple bottlenecks [1] Loop transformations for SMG2000 SMG2000, implements semi-coarsening multigrid on structured grids (ASC Purple benchmark) Residual computation has an SpMV bottleneck Loop below looks simple but non-trivial to extract for (si = 0; si < NS; ++si) for (k = 0; k < NZ; ++k) for (j = 0; j < NY; ++j) for (i = 0; i < NX; ++i) r[i + j*JR + k*KR] -= A[i + j*JA + k*KA + SA[si]] * x[i + j*JX + k*KX + Sx[si]] [1] SMG2000 demo [1] Before transformation for (si = 0; si < NS; si++) /* Loop1 */ for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (ii = 0; ii < NX; ii++) { /* Loop4 */ r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]] * x[ii + jj*JA + kk*KA + SA[si]]; } /* Loop4 */ } /* Loop3 */ } /* Loop2 */ } /* Loop1 */ [1] After transformation, including interchange, unrolling, and prefetching for (kk = 0; kk < NZ; kk++) { /* Loop2 */ for (jj = 0; jj < NY; jj++) { /* Loop3 */ for (si = 0; si < NS; si++) { /* Loop1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core Loop4 */ for ( ; ii < NX; ii++) { /* fringe Loop4 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe Loop4 */ } /* Loop1 */ } /* Loop3 */ } /* Loop2 */ [1] Loop transformations for SMG2000 2x speedup on kernel from specialization, loop interchange, unrolling, prefetching But only 1.25x overall---multiple bottlenecks Lesson: Need complex sequences of transformations Use profiling to guide Inspect run-time data for specialization Transformations are automatable Research topic: Automated specialization of hypre? [2] Slicing and dicing W3P Accelerator design code from SLAC calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh b = calcBasis (elem) f = calcField (b, mode) [2] Slicing and dicing W3P Accelerator design code calcBasis() very expensive Scaling problems as |Eigensystem| grows In principle, loop interchange or precomputation via slicing possible Challenges in practice “Loop nest” ~ 500+ LOC 150+ LOC to calcBasis() calcBasis() in 6-deep call chain, 4deep loop nest, 2 conditionals File I/O Changes must be unobtrusive /* Post-processing phase */ foreach mode in Eigensystem foreach elem in Mesh // { … b = calcBasis (elem) // } f = calcField (b, mode) writeDataToFiles (…); [2] W3P: Impact and lessons 4-5x speedup for post-processing step; 1.5x overall Changes “checked-in” Lesson: Need clean source-level transformations To automate, need robust program analysis and developer guidance Research: Annotation framework for developers [w/ Quinlan, Schordan, Yi: POHLL’06] [3] Structure splitting Convert (array of structs) into (struct of arrays) Improve spatial locality through increased stride-1 accesses Make code hardware-prefetch and vector/SIMD unit “friendly”c struct Type { double p; double x, y, z; double E; int k; } X[N], Y[N]; for (i = 0; i < N; i++) Y[i].E += Y[X[i].k].p; double Xp[N]; double Xx[N], Xy[N], Xz[N]; double XE[N]; int Xk[N]; // … same for Y … for (i = 0; i < N; i++) YE[i] += sqrt (Yp[Xk[i]]); [3] Structure splitting: Impact and challenges 2x speedup on a KULL benchmark (suggested by Brian Miller) Implementation challenges Potentially affects entire code Can apply only locally, at a cost Extra storage Overhead of copying Tedious to do by hand Lesson: Extensive data restructuring may be necessary Research: When and how best to split? [4] Finding a loop-fusion needle in a haystack Interprocedural loop fusion finder [w/ B. White : Cornell U.] Known example had 2x speedup on benchmark (Miller) Built “abstraction-aware” analyzer using ROSE First pass: Associate “loop signatures” with each function Second pass: Propagate signatures through call chains for (Zone::iterator z = zones.begin (); z != zones.end (); ++z) for (Corner::iterator c = (*z).corners().begin (); …) for (int s = 0; s < c->sides().size(); s++) … [4] Finding a loop-fusion needle in a haystack Found 6 examples of 3- and 4-deep nested loops “Analysis-only” tool Finds, though does not verify/transform Lesson: “Classical” optimizations relevant to abstraction use Research Recognizing and optimizing abstractions [White’s thesis, ongoing] Extending traditional optimizations to abstraction use [5] Aggregating messages (on-going) Idea: Merge sends (suggested by Miller) DataType A; // … operations on A … A.allToAll(); // … DataType B; // … operations on B … B.allToAll(); DataType A; // … operations on A … // … DataType B; // … operations on B … bulkAllToAll(A, B); Implementing a fully automated translator to find and transform Research: When and how best to aggregate? Summary of application-specific optimizations Like library-based approach, exploit knowledge for big gains Guidance from developer Use run-time information Would benefit from automated transformation tools Real code is hard to process Changes may become part of software re-engineering Need robust analysis and transformation infrastructure Range of tools possible: analysis and/or transformation No silver bullets or magic compilers Outline Motivation OSKI: An autotuned sparse kernel library “Real world” optimization Toward end-to-end application autotuning Summary and future work A framework for performance tuning Source: SciDAC Performance Engineering Research Institute (PERI) OSKI’s place in the tuning framework An empirical tuning framework using ROSE Empirical Tuning Framework using ROSE gprof, HPCtoolkit Open SpeedShop POET Search engine An end-to-end autotuning framework using ROSE Guiding philosophy Leverage external stand-alone components Provide open components and tools for community User or “system” profiles to collect data and/or analyses In ROSE Mark-up AST with data/analysis, to identify optimizable target(s) Outline target into stand-alone dynamically loadable library routine Make “benchmark” by inserting checkpoint library calls into app Generate parameterized representation of target Independent search engine performs search Interfaces to performance tools Mark-up AST with data, analysis, to identify optimizable target(s) gprof HPCToolkit [Mellor-Crummey : Rice] VizzAnalyzer / Vizz3D [Panas : LLNL] In progress: Open SpeedShop [Schulz : LLNL] Needed: Analysis to identify targets Outlining Outline target into dynamically loadable library routine Extends initial implementations by Liao [U. Houston], Jula [TAMU] Handles many details of C & C++ Wraps up variables, inserts declarations, generates call Produces suitable interfaces for dynamic loading Handles non-local control flow void OUT_38725__ (double* r, int JR, int KR, const double* A, …) { int si, j, k, i; for (si = 0; si < NS; si++) … r[i + j*JR + k*KR] -= A[i + … Making a benchmark Make “benchmark” by inserting checkpoint library calls Measure application behavior “in context” Use ckpt (user-level) [Zander : U. Wisc.] Insert timing code (cycle counter) May insert arbitrary code to distinguish calling contexts Reasonably fast in practice Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M For SMG2000: Problem consuming ~500 MB footprint takes ~30s to run Needed Best procedure to get accurate and fair comparisons? Do restarts resume in comparable states? More portable checkpoint library Example of “benchmark” (pseudo)code static int num_calls = 0; // no. of invocations of outlined code if (!num_calls) { ckpt (); // Checkpoint/resume OUT_38725__ = dlsym (…); // Load an implementation startTimer (); } OUT_38725__ (…); // outlined call-site if (++num_calls == CALL_LIMIT) { // Measured CALL_LIMIT calls stopTimer (); outputTime (); exit (0); } Generating parameterized representations Generate parameterized representation of target POET: Embedded scripting language for expressing parameterized code variations [see POHLL’07] Loop optimizer will generate POET for each target Hand-coded POET for SMG2000 Interchange Machine-specific: Unrolling, prefetching Source-specific: register & restrict keywords, C pointer idiom New parameterization for loop fusion [Zhao, Kennedy : Rice, Yi : UTSA] SMG2000 kernel POET instantiation for (kk = 0; kk < NZ; kk++) { /* L4 */ for (jj = 0; jj < NY; jj++) { /* L3 */ for (si = 0; si < NS; si++) { /* L1 */ double* rp = r + kk*Kr + jj*Jr; const double* Ap = A + kk*KA + jj*JA + SA[si]; const double* xp = x + kk*Kx + jj*Jx + Sx[si]; for (ii = 0; ii <= NX-3; ii += 3) { /* core L2 */ _mm_prefetch (Ap + PFD_A, _MM_HINT_NTA); _mm_prefetch (xp + PFD_X, _MM_HINT_NTA); rp[0] -= Ap[0] * xp[0]; rp[1] -= Ap[1] * xp[1]; rp[2] -= Ap[2] * xp[2]; rp += 3; Ap += 3; xp += 3; } /* core L2 */ for ( ; ii < NX; ii++) { /* fringe L2 */ rp[0] -= Ap[0] * xp[0]; rp++; Ap++; xp++; } /* fringe L2 */ } /* L1 */ } /* L3 */ } /* L4 */ Search We are search-engine agnostics Many possible hybrid modeling/search techniques Summary of autotuning compiler approach End-to-end framework leverages existing work ROSE provides a heavy-duty (robust) source-level infrastructure Assemble stand-alone components Current and future work Assembling a more complete end-to-end example Interfaces between components? Extending basic ROSE infrastructure, particularly program analysis Current and future research directions Autotuning End-to-end autotuning compiler framework Tuning for novel architectures (e.g., multicore) Tools for generating domain-specific libraries Performance modeling Kernel- and machine-specific analytical and statistical models Hybrid symbolic/empirical modeling Implications for applications and architectures? Tools for debugging massively parallel applications JitterBug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen] Static/dynamic analyses for debugging MPI End What is ROSE? Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning Infrastructure: Tool for building source-to-source optimizers Full compiler: basic program analysis, loop optimizer, OpenMP [UH] Support for C, C++; Fortran90 in progress Target “non-compiler audience” Open-source What is ROSE? Research: Develop techniques to optimize applications that rely heavily on high-level abstractions Target scientific computing apps relevant to DOE/LLNL Domain-specific analysis and optimization Optimize use of object-oriented abstractions Performance portability via empirical tuning Infrastructure: Tool for building source-to-source optimizers Full compiler: basic program analysis, loop optimizer, OpenMP [UH] Support for C, C++; Fortran90 in progress Target “non-compiler audience” Open-source Bug hunting in MPI programs Motivation: MPI is a large, complex API Bug pattern detectors Check basic API usage Adapt existing tools: MPI-CHECK; FindBugs; Farchi, et al. VC’05 Tasks requiring deeper program analysis Properly matched sends/receives, barriers, collectives Buffer errors, e.g., overruns, read before non-blocking op completes Temporal usage properties See error survey by DeSouza, Kuhn, & de Supinski ‘05 Extend existing analyses by Shires, et al., PDPTA’99; Strout, et al. ICPP‘06 Compiler-based testing tools Instrumentation and dynamic analysis to measure coverage [IBM] Measurement-unit validation via Osprey [Jiang and Su, UC Davis] Numerical interval/bounds analysis [Sun] Interface to MOPS model-checker [Collingbourne, Imperial College] Interactive program visualization via VizzAnalyzer [Panas, LLNL] Trends in uniprocessor SpMV performance (absolute Mflop/s) Trends in uniprocessor SpMV performance (fraction of peak) Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t Motivation: The Difficulty of Tuning SpMV // y <-- y + A*x for all A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=ptr[i] to ptr[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8x8 dense blocks Speedups on Itanium 2: The Need for Search Mflop/s (31.1%) Reference Mflop/s (7.6%) Speedups on Itanium 2: The Need for Search Mflop/s (31.1%) Best: 4x2 Reference Mflop/s (7.6%) SpMV Performance—raefsky3 SpMV Performance—raefsky3 Better, worse, or about the same? Pentium 4, 1.5 GHz Xeon, 3.2 GHz Better, worse, or about the same? Pentium 4, 1.5 GHz Xeon, 3.2 GHz * Faster, but relative improvement increases (20% ~50%) * Problem-Specific Performance Tuning Problem-Specific Optimization Techniques Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB Problem-Specific Optimization Techniques Optimizations for SpMV Register blocking (RB): up to 4x over CSR Variable block splitting: 2.1x over CSR, 1.8x over RB Diagonals: 2x over CSR Reordering to create dense structure + splitting: 2x over CSR Symmetry: 2.8x over CSR, 2.6x over RB Cache blocking: 3x over CSR Multiple vectors (SpMM): 7x over CSR And combinations… Sparse triangular solve Hybrid sparse/dense data structure: 1.8x over CSR Higher-level kernels AAT*x, ATA*x: 4x over CSR, 1.8x over RB A*x: 2x over CSR, 1.5x over RB BCSR Captures Regularly Aligned Blocks n = 21216 nnz = 1.5 M Source: NASA structural analysis problem 8x8 dense substructure Reduces storage Problem: Forced Alignment BCSR(2x2) Stored / true nz = 1.24 Problem: Forced Alignment BCSR(2x2) Stored / true nz = 1.24 BCSR(3x3) Stored / true nz = 1.46 Problem: Forced Alignment Implies UBCSR BCSR(2x2) Stored / true nz = 1.24 BCSR(3x3) Stored / true nz = 1.46 Forces i mod 3 = j mod 3 = 0 Unaligned BCSR format: Store row indices The Speedup Gap: BCSR vs. CSR The Speedup Gap Speedup: BCSR/CSR 1.1—1.5x gap Machine Approach: Splitting + Relaxed Block Alignment Goal: Close the gap between FEM classes Our approach: Capture actual structure more precisely Split: A = A1 + A2 + … + As Store each Ai in unaligned BCSR (UBCSR) format Relax both row and column alignment Buttari, et al. (2005) show improvements from relaxed column alignment 2.1x over no blocking, 1.8x over blocking When not faster than BCSR, may still reduce storage Variable Block Row (VBR) Analysis Partition by grouping consecutive rows/columns having same pattern From VBR, Identify Multiple Natural Block Sizes VBR with Fill Can also pad by matching rows/columns with nearly similar patterns Define VBR(q) = VBR where consecutive rows grouped when “similarity” q 0q1 VBR with Fill q=1 q = 0.7 Fill of 1% A Complex Tuning Problem Many parameters need “tuning” Fill threshold, .5 q 1 Number of splittings, 2 s 4 Ordering of block sizes, ri´ci; rs´cs = 1´1 See paper in HPCC 2005 for proof-of-concept experiments based on a semi-exhaustive search Heuristic in progress (uses Buttari, et al. (2005) work) Matrix FEM 2 10-ct20stif Dimensio n Matrices # non-zeros Dominant blocks 52k 2.7M 6x6 (39%), 3x3 (15%) 20k 1.3M 3x3 (96%) 16k 1.1M 1x1 (38%), 3x3 (23%) 41k 1.7M 2x1 (81%), 2x2 (19%) 23k 1.0M 1x1 (75%), 3x1 (12%) 141k 7.3M 6x6 (82%) 121k 4.8M 2x1 (26%), 1x2 (26%), 1x1 (26%), 2x2 (22%) 218k 11.6M 6x6 (94%) 47k 2.4M 2x2 (17%), 3x2 (15%), 2x3 (15%), 4x2 (9%), 2x4 (9%) 90k 4.8M 6x6 (99%) Engine block 12-raefsky4 Buckling 13-ex11 Fluid flow 15-Vavasis3 2D PDE 17-rim Fluid flow A-bmw7st_1 Car chassis B-cop20k_m Accel. Cavity C-pwtk Wind tunnel D-rma10 Charleston Harbor E-s3dkqm4 Power 4 Performance Storage Savings Traveling Salesman Problem-based Reordering Application: Stanford accelerator design problem (Omega3P) Reorder by approximately solving TSP [Pinar & Heath ‘97] Nodes = columns of A Weights(u, v) = no. of nz u, v have in common Tour = ordering of columns Choose maximum weight tour See [Pinar & Heath ’97] Also: symmetric storage, register blocking Manually selected optimizations Problem: High-cost of computing approximate solution to TSP 100x100 Submatrix Along Diagonal “Microscopic” Effect of Combined RCM+TSP Reordering Before: Green + Red After: Green + Blue Inter-Iteration Sparse Tiling (1/3) x1 t1 y1 x2 t2 y2 Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A2x t=Ax, y=At x3 t3 y3 x4 t4 y4 x5 t5 y5 Nodes: vector elements Edges: matrix elements aij Inter-Iteration Sparse Tiling (2/3) x1 t1 y1 x2 t2 y2 Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A2x t=Ax, y=At x3 t3 y3 x4 t4 y4 x5 t5 y5 Nodes: vector elements Edges: matrix elements aij Orange = everything needed to compute y1 Reuse a11, a12 Inter-Iteration Sparse Tiling (3/3) x1 t1 y1 x2 t2 y2 Idea: Strout, et al., ICCS 2001 Let A be 5x5 tridiagonal Consider y=A2x t=Ax, y=At x3 t3 y3 x4 t4 y4 x5 t5 y5 Nodes: vector elements Edges: matrix elements aij Orange = everything needed to compute y1 Reuse a11, a12 Grey = y2, y3 Reuse a23, a33, a43 Serial Sparse Tiling Performance (Itanium 2) OSKI Software Architecture and API Empirical Model Evaluation Tuning loop Compute a “tuning time budget” based on workload While (time remains and no tuning chosen) Try a heuristic Heuristic for blocked SpMV: Choose r x c to minimize estimated flops(A,r,c) predicted time(A,r,c) = benchmark Mflop /s(r,c) Tuning for workloads Weighted sums of empirical models Dynamic programming for alternatives Example: Combined y=ATAx vs. separate (w=Ax, y=ATw) Cost of Tuning Non-trivial run-time cost: up to ~40 mat-vecs Dominated by conversion time (~ 80%) Design point: user calls “tune” routine explicitly Exposes cost Tuning time limited using estimated workload Provided by user or inferred by library User may save tuning results To apply on future runs with similar matrix Stored in “human-readable” format Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y); /* Some dense BLAS op on vectors */ Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y); Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_SetHintMatMult (A_tunable, …, 500); oski_TuneMat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, , y ); r = ddot (x, y); Interface supports legacy app migration int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */ double* x = …, *y = …; /* Vectors */ /* Step 1: Create OSKI wrappers */ oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, …); oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE); oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE); /* Step 2: Call tune (with optional hints) */ oski_setHintMatMult (A_tunable, …, 500); oski_TuneMat (A_tunable); /* Compute y = ·y + ·A·x, 500 times */ for( i = 0; i < 500; i++ ) oski_MatMult (A_tunable, OP_NORMAL, , x_view, , y_view);// Step 3 r = ddot (x, y); Quick-and-dirty Parallelism: OSKI-PETSc Extend PETSc’s distributed memory SpMV (MATMPIAIJ) p0 PETSc Each process stores diag (all-local) and off-diag submatrices p1 OSKI-PETSc: p2 p3 Add OSKI wrappers Each submatrix tuned independently OSKI-PETSc Proof-of-Concept Results Matrix 1: Accelerator cavity design (R. Lee @ SLAC) N ~ 1 M, ~40 M non-zeros 2x2 dense block substructure Symmetric Matrix 2: Linear programming (Italian Railways) Short-and-fat: 4k x 1M, ~11M non-zeros Highly unstructured Big speedup from cache-blocking: no native PETSc format Evaluation machine: Xeon cluster Peak: 4.8 Gflop/s per node Accelerator cavity matrix from SLAC’s T3P code Additional Features: OSKI-Lua Embedded scripting language for selecting customized, complex transformations Mechanism to save/restore transformations /* In “my_app.c” */ fp = fopen(“my_xform.txt”, “rt”); fgets(buffer, BUFSIZE, fp); oski_ApplyMatTransform(A_tunable, buffer); oski_MatMult(A_tunable, …); # In file, “my_xform.txt” # Compute Afast = P*A*PT using Pinar’s reordering algorithm A_fast, P = reorder_TSP(InputMat); # Split Afast = A1 + A2, where A1 in 2x2 block format, A2 in CSR A1, A2 = A_fast.extract_blocks(2, 2); return transpose(P)*(A1+A2)*P; Current Work and Future Directions Current and Future Work on OSKI OSKI 1.0.1 at bebop.cs.berkeley.edu/oski “Pre-alpha” version of OSKI-PETSc available; “Beta” for Kokkos (Trilinos) Future work Evaluation on full solves/apps Bay area lithography shop - 2x speedup in full solve Code generators Studying use of higher-level OSKI kernels Port to additional architectures (e.g., vectors, SMPs) Additional heuristics [Buttari, et al. (2005)] Many BeBOP projects on-going SpMV benchmark for HPC-Challenge [Gavari & Hoemmen] Evaluation of Cell [Williams] Higher-level kernels, solvers [Hoemmen, Nishtala] Tuning collective communications [Nishtala] Cache-oblivious stencils [Kamil] ROSE: A Compiler-Based Approach to Tuning General Applications ROSE: Tool for building customized source-to-source tools (Quinlan, et al.) Full support for C and C++; Fortran90 in development Targets users with little or no compiler background Focus on performance optimization for scientific computing Domain-specific analysis and optimizations Object-oriented abstraction recognition Rich loop-transformation support Annotation language support Additional infrastructure support for s/w assurance, testing, and debugging Toward an end-to-end empirical tuning compiler Combines profiling, checkpointing, analysis, parameterized code generation, search Joint work with Qing Yi (University of Texas at San Antonio) Sponsored by U.S. Department of Energy ROSE Architecture Application Library Interface Annotations Front-end (EDG-based) AST Tools Mid-end AST fragment Abtraction Recognition Abstraction Aware Analysis Abstraction Elimination Extended Traditional Optimizations Source+AST Transformations AST Back-end Transformed application source Source fragment Source fragment AST fragment AST fragment