vuduc2007-jobtalk

advertisement
Autotuning sparse matrix kernels
Richard Vuduc
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
April 2, 2007
Riddle: Who am I?
I am a
Big Repository
Of useful
And useless
Facts alike.
View from space
Why “autotune” sparse kernels?
 Sparse kernels abound, but usually perform poorly
 Physical & economic models, PageRank, …
 Sparse matrix-vector multiply (SpMV) <= 10% peak, decreasing
 Performance challenges
 Indirect, irregular memory access
 Low computational intensity vs. dense linear algebra
 Depends on matrix (run-time) and machine
 Claim: Need automatic performance tuning
 2 speedup from tuning, will increase
 Manual tuning is difficult, getting harder
 Tune target app, input, machine using automated experiments
View from space
This talk: Key ideas and conclusions
 OSKI tunes sparse kernels automatically





Off-line benchmarking + empirical run-time models
Exploit structure aggressively
SpMV: 4 speedups, 31% peak
Higher-level kernels: ATA·x, Ak·x, …
Application impact
 Autotuning compiler? (very early-stage work)
 Different talks: Other work in high-perf. computing (HPC)
 Analytical and statistical performance modeling
 Compiler-based domain-specific optimization
 Parallel debugging using static and dynamic analysis
View from space
Outline
 A case for sparse kernel autotuning
 Trends?
 Why tune?
 OSKI
 Autotuning compiler
 Summary and future directions
A case for sparse kernel autotuning
SpMV crash course:
Compressed Sparse Row (CSR) storage
 Matrix-vector multiply: y = A*x
 for all A(i, j): y(i) = y(i) + A(i, j) * x(j)
Dominant cost: Compress?
Irregular, indirect: x[ind[…]]
“Regularize?”
A case for sparse kernel autotuning
Experiment: How hard is SpMV tuning?
 Exploit 88 blocks
 Compress data
 Unroll block multiplies
 Regularize accesses
 As rc , speed
A case for sparse kernel autotuning
Speedups on Itanium 2: The need for search
Mflop/s
(31.1%)
Best: 42
Mflop/s
(7.6%)
Reference
A case for sparse kernel autotuning
SpMV Performance—raefsky3
A case for sparse kernel autotuning
SpMV Performance—raefsky3
A case for sparse kernel autotuning
Better, worse, or about the same?
A case for sparse kernel autotuning
Better, worse, or about the same?
Itanium 2, 900 MHz  1.3 GHz
* Reference improves *
* Best possible worsens slightly *
A case for sparse kernel autotuning
Better, worse, or about the same?
Power4  Power5
* Reference worsens! *
* Relative importance of tuning increases *
A case for sparse kernel autotuning
Better, worse, or about the same?
Pentium M  Core 2 Duo (1-core)
* Reference & best improve; relative speedup improves (~1.4 to 1.6)
* Best decreases from 11% to 9.6% of peak *
A case for sparse kernel autotuning
More complex structures in practice
 Example: 33 blocking
 Logical grid of 33 cells
A case for sparse kernel autotuning
Extra work can improve efficiency!
 Example: 33 blocking
 Logical grid of 33 cells
 Fill-in explicit zeros
 Unroll 3x3 block multiplies
 “Fill ratio” = 1.5
 On Pentium III: 1.5
 i.e., 2/3 time
A case for sparse kernel autotuning
Trends: My predictions from 2003
 Need for “autotuning” will increase over time
 So kindly approve my dissertation topic
 Example: SpMV, 1987 to present
 Untuned: 10% of peak or less, decreasing
 Tuned: 2 speedup, increasing over time
 Tuning is getting harder (qualitative)


More complex machines & workloads
Parallelism
A case for sparse kernel autotuning
Trends in uniprocessor SpMV performance
(Mflop/s), pre-2004
A case for sparse kernel autotuning
Trends in uniprocessor SpMV performance
(fraction of peak)
A case for sparse kernel autotuning
Summary: A case for sparse kernel
autotuning
 “Trends” for SpMV
 10% of peak, decreasing
 2x speedup, increasing
 No guarantees with new generation systems
 Manual tuning is “hard”
 Sparse differs from dense




Not LINPACK!
Data structure overheads
Indirect, irregular memory access
Run-time dependence (i.e., matrix)
A case for sparse kernel autotuning
Outline
 A case for sparse kernel autotuning
 OSKI
 Tuning is hard—how to automate?
 Autotuning compiler
 Summary and future directions
Basic approach to autotuning
 High-level idea: Given kernel, machine…
 Identify “space” of candidate implementations
 Generate implementations in the space
 Choose (search for) fastest by modeling and/or experiments
 Example: Dense matrix multiply
 Space = {Impl(R, C)}, where R x C is cache block size
 Measure time to run Impl(R, C) for all R, C
 Perform this expensive search once per machine
 Successes: ATLAS/PHiPAC, FFTW, SPIRAL, …
 SpMV: Depends on run-time input (matrix)!
The view from space
OSKI: Optimized Sparse Kernel Interface
 Autotuned kernels for user’s matrix & machine




a la PHiPAC/ATLAS, FFTW, SPIRAL, …
BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), …
Hides complexity of run-time tuning
Includes fast locality-aware kernels: ATA·x, Ak·x, …
 Fast in practice
 Standard SpMV < 10% peak, vs. up to 31% with OSKI
 Up to 4 faster SpMV, 1.8 triangular solve, 4x ATA·x, …
 For “advanced” users & solver library writers
 OSKI-PETSc; Trilinos (Heroux)
 Adopted by ClearShape, Inc. for shipping product (2 speedup)
OSKI
How OSKI tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Matrix
Workload
from program
monitoring
History
1. Evaluate
Models
Heuristic
models
2. Select
Data Struct.
& Code
OSKI
To user:
Matrix handle
for kernel
calls
Heuristic model example: Select block size
 Idea: Hybrid off-line / run-time model
 Characterize machine with off-line benchmark


Precompute Mflops(r, c) using dense matrix for all r, c
Once per machine
 Estimate matrix properties at run-time

Sample A to estimate Fill(r, c)
 Run-time “search”

Select r, c to maximize Mflops(r, c) / Fill(r, c)
 Run-time costs ~ 40 SpMVs
 80%+ = time to convert to new r  c format
OSKI
Accuracy of
the Tuning Heuristics (1/4)
DGEMV
OSKI
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
Accuracy of
the Tuning Heuristics (2/4)
DGEMV
OSKI
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
Tunable optimization techniques
 Optimizations for SpMV








Register blocking (RB): up to 4 over CSR
Variable block splitting: 2.1 over CSR, 1.8 over RB
Diagonals: 2 over CSR
Reordering to create dense structure + splitting: 2 over CSR
Symmetry: 2.8 over CSR, 2.6 over RB
Cache blocking: 3 over CSR
Multiple vectors (SpMM): 7 over CSR
And combinations…
 Sparse triangular solve
 Hybrid sparse/dense data structure: 1.8 over CSR
 Higher-level kernels
 AAT·x or ATA·x: 4 over CSR, 1.8 over RB
 A2·x: 2 over CSR, 1.5 over RB
OSKI
Structural splitting for complex patterns
 Idea: Split A = A1 + A2 + …, and tune Ai independently
 Sample to detect “canonical” structures
 Saves time and/or storage (avoid fill)
The view from space
Example: Variable Block Row (Matrix #12)
2.1
over CSR
1.8
over RB
The view from space
Example: Row-segmented diagonals
2
over CSR
The view from space
Dense sub-triangles for triangular solve
 Solve Tx = b for x, T triangular
 Raefsky4 (structural problem) +
SuperLU + colmmd
 N=19779, nnz=12.6 M
Dense trailing triangle:
dim=2268, 20% of total nz
Can be as high as 90+%!
The view from space
Cache optimizations for AAT·x
 Idea: Interleave multiplication by A, AT
a 
n


T
T
AA x  a1  an    x   ai (ai x)
 T 
i 1
a
 n 
T
1
“axpy”
dot product
 Combine with register optimizations: ai = r  c block row
The view from space
OSKI tunes for workloads
 Bi-conjugate gradients - equal mix of A·x and AT·y
 31: A·x, AT·y = 1053, 343 Mflop/s  517 Mflop/s
 33: A·x, AT·y = 806, 826 Mflop/s  816 Mflop/s
 Higher-level operation - (A·x, AT·y) kernel
 31: 757 Mflop/s
 33: 1400 Mflop/s
 Workload tuning
 Evaluate weighted sums of empirical models
 Dynamic programming to evaluate alternatives
OSKI
Matrix powers kernel
 Idea: Serial sparse tiling (Strout, et al.)
 Extend for arbitrary A, combine other techniques (e.g., blocking)
 Compute dependences, layer new data structure on (B)CSR
 Matrix powers (Ak·x) with data structure transformations
 A2·x: up to 2 faster
 New latency-tolerant solvers? (Hoemmen’s thesis at Berkeley)
The view from space
Example: A2·x
a11
x1
a11
t1
a12
y1
 Let A be 55 tridiagonal
 Consider y = A2·x
 t = A·x, y = A·t
a12
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
 Nodes: vector elements
 Edges: matrix elements aij
The view from space
Example: A2·x
a11
x1
a11
t1
a12
y1
 Let A be 55 tridiagonal
 Consider y = A2·x
 t=A·x, y=A·t
a12
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
 Nodes: vector elements
 Edges: matrix elements aij
 Orange = everything needed to
compute y1
 Reuse a11, a12
The view from space
Example: A2·x
x1
t1
y1
 Let A be 55 tridiagonal
 Consider y = A2·x
 t=A·x, y=A·t
x2
t2
y2
x3
t3
y3
x4
t4
y4
 Nodes: vector elements
 Edges: matrix elements aij
 Orange = everything needed to
compute y1
 Reuse a11, a12
 Grey = y2, y3
 Reuse a23, a33, a43
x5
t5
y5
The view from space
Examples of OSKI’s early impact
 Integrating into major linear solver libraries
 PETSc
 Trilinos – R&D100 (Heroux)
 Early adopter: ClearShape, Inc.
 Core product: lithography process simulator
 2 speedup on full simulation after using OSKI
 Proof-of-concept: SLAC T3P accelerator design app
 SpMV dominates execution time
 Symmetry, 22 block structure
 2 speedups
OSKI
OSKI-PETSc Performance: Accel. Cavity
OSKI
Outline
 A case for sparse kernel autotuning
 OSKI
 Autotuning compiler
 Sparse kernels aren’t my bottleneck…
 Summary and future directions
Strengths and limits of the library approach
 Strengths
 Isolates optimization in the library for portable performance
 Exploits domain-specific information aggressively
 Handles run-time tuning naturally
 Limitations




“Generation Me”: What about my application?
Run-time tuning: run-time overheads
Limited context for optimization (w/o delayed evaluation)
Limited extensibility (fixed interfaces)
An autotuning compiler
A framework for performance tuning
Source: SciDAC Performance Engineering Research Institute (PERI)
An autotuning compiler
OSKI’s place in the tuning framework
An autotuning compiler
An empirical tuning framework using ROSE
Empirical Tuning
Framework using ROSE
gprof,
HPCtoolkit
Open SpeedShop
POET
Search engine
An autotuning compiler
What is ROSE?
 Research: Optimize use of high-level abstractions





Lead: Dan Quinlan at LLNL
Target DOE apps
Study application-specific analysis and optimization
Extend traditional compiler optimizations to abstraction use
Performance portability via empirical tuning
 Infrastructure: Tool for building source-to-source tools




Full compiler: basic analysis, loop optimizer, OpenMP
Support for C, C++; Fortran 90 in progress
Targets “non-compiler audience”
Open-source
An autotuning compiler
A compiler-based autotuning framework
 Guiding philosophy
 Leverage external stand-alone components
 Provide open components and tools for community
 User or “system” profiles to collect data and/or analyses
 In ROSE




Profile-based analysis to find targets
Extract (“outline”) targets
Make “benchmark” using checkpointing
Generate parameterized target (not yet automated)
 Independent search engine performs search
An autotuning compiler
Case study: Loop optimizations for
SMG2000
 SMG2000, implements semi-coarsening multigrid on
structured grids (ASC Purple benchmark)
 Residual computation has an SpMV bottleneck
 Loop below looks simple but non-trivial to extract
for (si = 0; si < NS; ++si)
for (k = 0; k < NZ; ++k)
for (j = 0; j < NY; ++j)
for (i = 0; i < NX; ++i)
r[i + j*JR + k*KR] -=
A[i + j*JA + k*KA + SA[si]]
* x[i + j*JX + k*KX + Sx[si]]
An autotuning compiler
Before transformation
for (si = 0; si < NS; si++) /* Loop1 */
for (kk = 0; kk < NZ; kk++) { /* Loop2 */
for (jj = 0; jj < NY; jj++) { /* Loop3 */
for (ii = 0; ii < NX; ii++) { /* Loop4 */
r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]]
* x[ii + jj*JA + kk*KA + SA[si]];
} /* Loop4 */
} /* Loop3 */
} /* Loop2 */
} /* Loop1 */
An autotuning compiler
After transformation, including interchange,
unrolling, and prefetching
for (kk = 0; kk < NZ; kk++) { /* Loop2 */
for (jj = 0; jj < NY; jj++) { /* Loop3 */
for (si = 0; si < NS; si++) { /* Loop1 */
double* rp = r + kk*Kr + jj*Jr;
const double* Ap = A + kk*KA + jj*JA + SA[si];
const double* xp = x + kk*Kx + jj*Jx + Sx[si];
for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */
_mm_prefetch (Ap + PFD_A, _MM_HINT_NTA);
_mm_prefetch (xp + PFD_X, _MM_HINT_NTA);
rp[0] -= Ap[0] * xp[0];
rp[1] -= Ap[1] * xp[1];
rp[2] -= Ap[2] * xp[2];
rp += 3; Ap += 3; xp += 3;
} /* core Loop4 */
for ( ; ii < NX; ii++) { /* fringe Loop4 */
rp[0] -= Ap[0] * xp[0];
rp++; Ap++; xp++;
} /* fringe Loop4 */
} /* Loop1 */
} /* Loop3 */
} /* Loop2 */
An autotuning compiler
Loop optimizations for SMG2000
 2x speedup on kernel from specialization, loop
interchange, unrolling, prefetching
 But only 1.25 overall—multiple bottlenecks
 Lesson: Need complex sequences of transformations
 Use profiling to guide
 Inspect run-time data for specialization
 Transformations are automatable
An autotuning compiler
Generating parameterized representations
(w/ Q. Yi at UTSA)
 Generate parameterized representation of target
 POET: Embedded scripting language for expressing
parameterized code variations [see IPDPS/POHLL’07]
 ROSE’s loop optimizer will generate POET for each target
 Hand-coded POET for SMG2000
 Interchange
 Machine-specific: Unrolling, prefetching
 Source-specific: register & restrict keywords, C pointer idiom
 Related: New parameterization for loop fusion [w/ Zhao,
Kennedy : Rice, Yi : UTSA]
An autotuning compiler
Outline




A case for sparse kernel autotuning
OSKI
Autotuning compiler
Summary and future directions
General theme: Aggressively exploit
structure
 Application- and architecture-specific optimization
 E.g., Sparse matrix patterns
 Robust performance in spite of architecture-specific pecularities
 Augment static models with benchmarking and search
 Short-term OSKI extensions
 Integrate into large-scale apps


Accelerator design, plasma physics (DOE)
Geophysical simulation based on Block Lanczos (ATA*X; LBL)
 Evaluate emerging architectures (e.g., vector micros, CELL)
 Other kernels: Matrix triple products
 Parallelism (OSKI-PETSc)
View from space (revisited)
Current and future research directions
 Autotuning: Robust portable performance?




End-to-end autotuning compiler framework
New domains: PageRank, cryptokernels
Tuning for novel architectures (e.g., multicore)
Tools for generating domain-specific libraries
 Performance modeling: Limits and insight?
 Kernel- and machine-specific analytical and statistical models
 Hybrid symbolic/empirical modeling
 Implications for applications and architectures?
 Debugging massively parallel applications
 JitterBug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen]
 Static/dynamic analyses for debugging MPI
View from space (revisited)
End
Download