OSKI: A library of automatically tuned sparse matrix kernels

advertisement
Autotuning sparse matrix kernels
Richard Vuduc
Center for Applied Scientific Computing (CASC)
Lawrence Livermore National Laboratory
February 28, 2007
Predictions (2003)
 Need for “autotuning” will increase over time
 Improve performance for given app & machine using automated
experiments
 Example: Sparse matrix-vector multiply (SpMV), 1987 to
present
 Untuned: 10% of peak or less, decreasing
 Tuned: 2x speedup, increasing over time
 Tuning is getting harder (qualitative)
 More complex machines & workloads
 Parallelism
Trends in uniprocessor SpMV performance
(Mflop/s), pre-2004
Trends in uniprocessor SpMV performance
(Mflop/s), pre-2004
Trends in uniprocessor SpMV performance
(fraction of peak)
Is tuning getting easier?
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t = 0
for k=ptr[i] to ptr[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
• Exploit 8x8 dense blocks
• As r x c , Mflop/s
Speedups on Itanium 2: The need for search
Mflop/s (31.1%)
Reference
Mflop/s (7.6%)
Speedups on Itanium 2: The need for search
Mflop/s (31.1%)
Best: 4x2
Reference
Mflop/s (7.6%)
SpMV Performance—raefsky3
SpMV Performance—raefsky3
Better, worse, or about the same?
Itanium 2, 900 MHz  1.3 GHz
Better, worse, or about the same?
Itanium 2, 900 MHz  1.3 GHz
* Reference improves *
Better, worse, or about the same?
Itanium 2, 900 MHz  1.3 GHz
* Best possible worsens slightly *
Better, worse, or about the same?
Pentium M  Core 2 Duo (1-core)
Better, worse, or about the same?
Pentium M  Core 2 Duo (1-core)
* Reference & best improve; relative speedup improves (~1.4 to 1.6) *
Better, worse, or about the same?
Pentium M  Core 2 Duo (1-core)
* Note: Best fraction of peak decreased from 11%  9.6% *
Better, worse, or about the same?
Power4  Power5
Better, worse, or about the same?
Power4  Power5
* Reference worsens! *
Better, worse, or about the same?
Power4  Power5
* Relative importance of tuning increases *
A framework for performance tuning
Source: SciDAC Performance Engineering Research Institute (PERI)
Outline





Motivation
OSKI: An autotuned sparse kernel library
Application-specific optimization “in the wild”
Toward end-to-end application autotuning
Summary and future work
Outline





Motivation
OSKI: An autotuned sparse kernel library
Application-specific optimization “in the wild”
Toward end-to-end application autotuning
Summary and future work
OSKI: Optimized Sparse Kernel Interface
 Autotuned kernels for user’s matrix & machine
 BLAS-style interface: mat-vec (SpMV), tri. solve (TrSV), …
 Hides complexity of run-time tuning
 Includes fast locality-aware kernels: ATA*x, …
 Faster than standard implementations
 Standard SpMV < 10% peak, vs. up to 31% with OSKI
 Up to 4x faster SpMV, 1.8x TrSV, 4x ATA*x, …
 For “advanced” users & solver library writers
 PETSc extension available (OSKI-PETSc)
 Kokkos (for Trilinos) by Heroux
 Adopted by ClearShape, Inc. for shipping product (2x speedup)
Tunable matrix-specific optimization
techniques
 Optimizations for SpMV








Register blocking (RB): up to 4x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Reordering to create dense structure + splitting: 2x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Cache blocking: 3x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
 Sparse triangular solve
 Hybrid sparse/dense data structure: 1.8x over CSR
 Higher-level kernels
 AAT*x, ATA*x: 4x over CSR, 1.8x over RB
 A*x: 2x over CSR, 1.5x over RB
Tuning for workloads
 Bi-conjugate gradients - equal mix of A*x and AT*y
 3x1: Ax, ATy = 1053, 343 Mflop/s  517 Mflop/s
 3x3: Ax, ATy = 806, 826 Mflop/s  816 Mflop/s
 Higher-level operation - (Ax, ATy) kernel
 3x1: 757 Mflop/s
 3x3: 1400 Mflop/s
 Matrix powers (Ak*x) with data structure transformations
 A2*x: up to 2x faster
 New latency-tolerant solvers? (Hoemmen’s thesis, on-going at
UCB)
How OSKI tunes (Overview)
Library Install-Time (offline)
Application Run-Time
How OSKI tunes (Overview)
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Application Run-Time
How OSKI tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Matrix
Workload
from program
monitoring
History
1. Evaluate
Models
Heuristic
models
How OSKI tunes (Overview)
Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
Application Run-Time
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Matrix
Workload
from program
monitoring
History
1. Evaluate
Models
Heuristic
models
2. Select
Data Struct.
& Code
To user:
Matrix handle
for kernel
calls
OSKI’s place in the tuning framework
Examples of OSKI’s early impact
 Early adopter: ClearShape, Inc.
 Core product: lithography simulator
 2x speedup on full simulation after using OSKI
 Proof-of-concept: SLAC T3P accelerator cavity design
simulator
 SpMV dominates execution time
 Symmetry, 2x2 block structure
 2x speedups
OSKI-PETSc Performance: Accel. Cavity
Strengths and limitations of the library
approach
 Strengths
 Isolates optimization in the library for portable performance
 Exploits domain-specific information aggressively
 Handles run-time tuning naturally
 Limitations
 “Generation Me”: What about my application and its
abstractions?
 Run-time tuning: run-time overheads
 Limited context for optimization (without delayed evaluation)
 Limited extensibility (fixed interfaces)
Outline





Motivation
OSKI: An autotuned sparse kernel library
Application-specific optimization “in the wild”
Toward end-to-end application autotuning
Summary and future work
Tour of application-specific optimizations
 Five case studies
 Common characteristics
 Complex code
 Heavy use of abstraction
 Use generated code (e.g., SWIG C++/Python bindings)
 Benefit from extensive code and data restructuring
 Multiple bottlenecks
[1] Loop transformations for SMG2000
 SMG2000, implements semi-coarsening multigrid on
structured grids (ASC Purple benchmark)
 Residual computation has an SpMV bottleneck
 Loop below looks simple but non-trivial to extract
for (si = 0; si < NS; ++si)
for (k = 0; k < NZ; ++k)
for (j = 0; j < NY; ++j)
for (i = 0; i < NX; ++i)
r[i + j*JR + k*KR] -=
A[i + j*JA + k*KA + SA[si]]
* x[i + j*JX + k*KX + Sx[si]]
[1] SMG2000 demo
[1] Before transformation
for (si = 0; si < NS; si++) /* Loop1 */
for (kk = 0; kk < NZ; kk++) { /* Loop2 */
for (jj = 0; jj < NY; jj++) { /* Loop3 */
for (ii = 0; ii < NX; ii++) { /* Loop4 */
r[ii + jj*Jr + kk*Kr] -= A[ii + jj*JA + kk*KA + SA[si]]
* x[ii + jj*JA + kk*KA + SA[si]];
} /* Loop4 */
} /* Loop3 */
} /* Loop2 */
} /* Loop1 */
[1] After transformation, including
interchange, unrolling, and prefetching
for (kk = 0; kk < NZ; kk++) { /* Loop2 */
for (jj = 0; jj < NY; jj++) { /* Loop3 */
for (si = 0; si < NS; si++) { /* Loop1 */
double* rp = r + kk*Kr + jj*Jr;
const double* Ap = A + kk*KA + jj*JA + SA[si];
const double* xp = x + kk*Kx + jj*Jx + Sx[si];
for (ii = 0; ii <= NX-3; ii += 3) { /* core Loop4 */
_mm_prefetch (Ap + PFD_A, _MM_HINT_NTA);
_mm_prefetch (xp + PFD_X, _MM_HINT_NTA);
rp[0] -= Ap[0] * xp[0];
rp[1] -= Ap[1] * xp[1];
rp[2] -= Ap[2] * xp[2];
rp += 3; Ap += 3; xp += 3;
} /* core Loop4 */
for ( ; ii < NX; ii++) { /* fringe Loop4 */
rp[0] -= Ap[0] * xp[0];
rp++; Ap++; xp++;
} /* fringe Loop4 */
} /* Loop1 */
} /* Loop3 */
} /* Loop2 */
[1] Loop transformations for SMG2000
 2x speedup on kernel from specialization, loop
interchange, unrolling, prefetching
 But only 1.25x overall---multiple bottlenecks
 Lesson: Need complex sequences of transformations
 Use profiling to guide
 Inspect run-time data for specialization
 Transformations are automatable
 Research topic: Automated specialization of hypre?
[2] Slicing and dicing W3P
 Accelerator design code from SLAC
 calcBasis() very expensive
 Scaling problems as |Eigensystem|
grows
 In principle, loop interchange or
precomputation via slicing possible
/* Post-processing phase */
foreach mode in Eigensystem
foreach elem in Mesh
b = calcBasis (elem)
f = calcField (b, mode)
[2] Slicing and dicing W3P
 Accelerator design code
 calcBasis() very expensive
 Scaling problems as |Eigensystem|
grows
 In principle, loop interchange or
precomputation via slicing possible
 Challenges in practice
 “Loop nest” ~ 500+ LOC
 150+ LOC to calcBasis()
 calcBasis() in 6-deep call chain, 4deep loop nest, 2 conditionals
 File I/O
 Changes must be unobtrusive
/* Post-processing phase */
foreach mode in Eigensystem
foreach elem in Mesh
// { …
b = calcBasis (elem)
// }
f = calcField (b, mode)
writeDataToFiles (…);
[2] W3P: Impact and lessons
 4-5x speedup for post-processing step; 1.5x overall
 Changes “checked-in”
 Lesson: Need clean source-level transformations
 To automate, need robust program analysis and developer
guidance
 Research: Annotation framework for developers
 [w/ Quinlan, Schordan, Yi: POHLL’06]
[3] Structure splitting
 Convert (array of structs) into (struct of arrays)
 Improve spatial locality through increased stride-1 accesses
 Make code hardware-prefetch and vector/SIMD unit “friendly”c
struct Type {
double p;
double x, y, z;
double E;
int k;
} X[N], Y[N];
for (i = 0; i < N; i++)
Y[i].E += Y[X[i].k].p;
double Xp[N];
double Xx[N], Xy[N], Xz[N];
double XE[N];
int Xk[N];
// … same for Y …
for (i = 0; i < N; i++)
YE[i] += sqrt (Yp[Xk[i]]);
[3] Structure splitting: Impact and challenges
 2x speedup on a KULL benchmark (suggested by Brian
Miller)
 Implementation challenges
 Potentially affects entire code
 Can apply only locally, at a cost
 Extra storage
 Overhead of copying
 Tedious to do by hand
 Lesson: Extensive data restructuring may be necessary
 Research: When and how best to split?
[4] Finding a loop-fusion needle in a
haystack
 Interprocedural loop fusion finder [w/ B. White : Cornell
U.]
 Known example had 2x speedup on benchmark (Miller)
 Built “abstraction-aware” analyzer using ROSE
 First pass: Associate “loop signatures” with each function
 Second pass: Propagate signatures through call chains
for (Zone::iterator z = zones.begin (); z != zones.end (); ++z)
for (Corner::iterator c = (*z).corners().begin (); …)
for (int s = 0; s < c->sides().size(); s++)
…
[4] Finding a loop-fusion needle in a
haystack
 Found 6 examples of 3- and 4-deep nested loops
 “Analysis-only” tool
 Finds, though does not verify/transform
 Lesson: “Classical” optimizations relevant to abstraction
use
 Research
 Recognizing and optimizing abstractions [White’s thesis, ongoing]
 Extending traditional optimizations to abstraction use
[5] Aggregating messages (on-going)
 Idea: Merge sends (suggested by Miller)
DataType A;
// … operations on A …
A.allToAll();
// …
DataType B;
// … operations on B …
B.allToAll();
DataType A;
// … operations on A …
// …
DataType B;
// … operations on B …
bulkAllToAll(A, B);
 Implementing a fully automated translator to find and
transform
 Research: When and how best to aggregate?
Summary of application-specific
optimizations
 Like library-based approach, exploit knowledge for big
gains
 Guidance from developer
 Use run-time information
 Would benefit from automated transformation tools




Real code is hard to process
Changes may become part of software re-engineering
Need robust analysis and transformation infrastructure
Range of tools possible: analysis and/or transformation
 No silver bullets or magic compilers
Outline





Motivation
OSKI: An autotuned sparse kernel library
“Real world” optimization
Toward end-to-end application autotuning
Summary and future work
A framework for performance tuning
Source: SciDAC Performance Engineering Research Institute (PERI)
OSKI’s place in the tuning framework
An empirical tuning framework using ROSE
Empirical Tuning
Framework using ROSE
gprof,
HPCtoolkit
Open SpeedShop
POET
Search engine
An end-to-end autotuning framework using
ROSE
 Guiding philosophy
 Leverage external stand-alone components
 Provide open components and tools for community
 User or “system” profiles to collect data and/or analyses
 In ROSE
 Mark-up AST with data/analysis, to identify optimizable target(s)
 Outline target into stand-alone dynamically loadable library
routine
 Make “benchmark” by inserting checkpoint library calls into app
 Generate parameterized representation of target
 Independent search engine performs search
Interfaces to performance tools
 Mark-up AST with data, analysis, to identify optimizable
target(s)




gprof
HPCToolkit [Mellor-Crummey : Rice]
VizzAnalyzer / Vizz3D [Panas : LLNL]
In progress: Open SpeedShop [Schulz : LLNL]
 Needed: Analysis to identify targets
Outlining
 Outline target into dynamically loadable library routine
 Extends initial implementations by Liao [U. Houston], Jula [TAMU]
 Handles many details of C & C++
 Wraps up variables, inserts declarations, generates call
 Produces suitable interfaces for dynamic loading
 Handles non-local control flow
void OUT_38725__ (double* r, int JR, int KR,
const double* A, …) {
int si, j, k, i;
for (si = 0; si < NS; si++)
…
r[i + j*JR + k*KR] -= A[i + …
Making a benchmark
 Make “benchmark” by inserting checkpoint library calls




Measure application behavior “in context”
Use ckpt (user-level) [Zander : U. Wisc.]
Insert timing code (cycle counter)
May insert arbitrary code to distinguish calling contexts
 Reasonably fast in practice
 Checkpoint read/write bandwidth: 500 MB/s on my Pentium-M
 For SMG2000: Problem consuming ~500 MB footprint takes ~30s to
run
 Needed
 Best procedure to get accurate and fair comparisons?
 Do restarts resume in comparable states?
 More portable checkpoint library
Example of “benchmark” (pseudo)code
static int num_calls = 0; // no. of invocations of outlined code
if (!num_calls)
{
ckpt (); // Checkpoint/resume
OUT_38725__ = dlsym (…); // Load an implementation
startTimer ();
}
OUT_38725__ (…); // outlined call-site
if (++num_calls == CALL_LIMIT)
{ // Measured CALL_LIMIT calls
stopTimer ();
outputTime ();
exit (0);
}
Generating parameterized representations
 Generate parameterized representation of target
 POET: Embedded scripting language for expressing
parameterized code variations [see POHLL’07]
 Loop optimizer will generate POET for each target
 Hand-coded POET for SMG2000
 Interchange
 Machine-specific: Unrolling, prefetching
 Source-specific: register & restrict keywords, C pointer idiom
 New parameterization for loop fusion [Zhao, Kennedy :
Rice, Yi : UTSA]
SMG2000 kernel POET instantiation
for (kk = 0; kk < NZ; kk++) { /* L4 */
for (jj = 0; jj < NY; jj++) { /* L3 */
for (si = 0; si < NS; si++) { /* L1 */
double* rp = r + kk*Kr + jj*Jr;
const double* Ap = A + kk*KA + jj*JA + SA[si];
const double* xp = x + kk*Kx + jj*Jx + Sx[si];
for (ii = 0; ii <= NX-3; ii += 3) { /* core L2 */
_mm_prefetch (Ap + PFD_A, _MM_HINT_NTA);
_mm_prefetch (xp + PFD_X, _MM_HINT_NTA);
rp[0] -= Ap[0] * xp[0];
rp[1] -= Ap[1] * xp[1];
rp[2] -= Ap[2] * xp[2];
rp += 3; Ap += 3; xp += 3;
} /* core L2 */
for ( ; ii < NX; ii++) { /* fringe L2 */
rp[0] -= Ap[0] * xp[0];
rp++; Ap++; xp++;
} /* fringe L2 */
} /* L1 */
} /* L3 */
} /* L4 */
Search
 We are search-engine agnostics
 Many possible hybrid modeling/search techniques
Summary of autotuning compiler approach
 End-to-end framework leverages existing work
 ROSE provides a heavy-duty (robust) source-level infrastructure
 Assemble stand-alone components
 Current and future work
 Assembling a more complete end-to-end example
 Interfaces between components?
 Extending basic ROSE infrastructure, particularly program
analysis
Current and future research directions
 Autotuning
 End-to-end autotuning compiler framework
 Tuning for novel architectures (e.g., multicore)
 Tools for generating domain-specific libraries
 Performance modeling
 Kernel- and machine-specific analytical and statistical models
 Hybrid symbolic/empirical modeling
 Implications for applications and architectures?
 Tools for debugging massively parallel applications
 JitterBug [w/ Schulz, Quinlan, de Supinski, Saebjoernsen]
 Static/dynamic analyses for debugging MPI
End
What is ROSE?
 Research: Develop techniques to optimize applications
that rely heavily on high-level abstractions




Target scientific computing apps relevant to DOE/LLNL
Domain-specific analysis and optimization
Optimize use of object-oriented abstractions
Performance portability via empirical tuning
 Infrastructure: Tool for building source-to-source
optimizers
 Full compiler: basic program analysis, loop optimizer, OpenMP
[UH]
 Support for C, C++; Fortran90 in progress
 Target “non-compiler audience”
 Open-source
What is ROSE?
 Research: Develop techniques to optimize applications
that rely heavily on high-level abstractions




Target scientific computing apps relevant to DOE/LLNL
Domain-specific analysis and optimization
Optimize use of object-oriented abstractions
Performance portability via empirical tuning
 Infrastructure: Tool for building source-to-source
optimizers
 Full compiler: basic program analysis, loop optimizer, OpenMP
[UH]
 Support for C, C++; Fortran90 in progress
 Target “non-compiler audience”
 Open-source
Bug hunting in MPI programs
 Motivation: MPI is a large, complex API
 Bug pattern detectors
 Check basic API usage
 Adapt existing tools: MPI-CHECK; FindBugs; Farchi, et al. VC’05
 Tasks requiring deeper program analysis
 Properly matched sends/receives, barriers, collectives
 Buffer errors, e.g., overruns, read before non-blocking op
completes
 Temporal usage properties
 See error survey by DeSouza, Kuhn, & de Supinski ‘05
 Extend existing analyses by Shires, et al., PDPTA’99; Strout, et
al. ICPP‘06
Compiler-based testing tools
 Instrumentation and dynamic analysis to measure
coverage [IBM]
 Measurement-unit validation via Osprey [Jiang and Su,
UC Davis]
 Numerical interval/bounds analysis [Sun]
 Interface to MOPS model-checker [Collingbourne, Imperial
College]
 Interactive program visualization via VizzAnalyzer [Panas,
LLNL]
Trends in uniprocessor SpMV performance
(absolute Mflop/s)
Trends in uniprocessor SpMV performance
(fraction of peak)
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t=0
for k=ptr[i] to ptr[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
Motivation: The Difficulty of Tuning SpMV
// y <-- y + A*x
for all A(i,j):
y(i) += A(i,j) * x(j)
// Compressed sparse row (CSR)
for each row i:
t=0
for k=ptr[i] to ptr[i+1]-1:
t += A[k] * x[J[k]]
y[i] = t
• Exploit 8x8 dense blocks
Speedups on Itanium 2: The Need for
Search
Mflop/s (31.1%)
Reference
Mflop/s (7.6%)
Speedups on Itanium 2: The Need for
Search
Mflop/s (31.1%)
Best: 4x2
Reference
Mflop/s (7.6%)
SpMV Performance—raefsky3
SpMV Performance—raefsky3
Better, worse, or about the same?
Pentium 4, 1.5 GHz  Xeon, 3.2 GHz
Better, worse, or about the same?
Pentium 4, 1.5 GHz  Xeon, 3.2 GHz
* Faster, but relative improvement increases (20%  ~50%) *
Problem-Specific Performance
Tuning
Problem-Specific Optimization Techniques
 Optimizations for SpMV








Register blocking (RB): up to 4x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Reordering to create dense structure + splitting: 2x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Cache blocking: 3x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
 Sparse triangular solve
 Hybrid sparse/dense data structure: 1.8x over CSR
 Higher-level kernels
 AAT*x, ATA*x: 4x over CSR, 1.8x over RB
 A*x: 2x over CSR, 1.5x over RB
Problem-Specific Optimization Techniques
 Optimizations for SpMV








Register blocking (RB): up to 4x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Reordering to create dense structure + splitting: 2x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Cache blocking: 3x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
 Sparse triangular solve
 Hybrid sparse/dense data structure: 1.8x over CSR
 Higher-level kernels
 AAT*x, ATA*x: 4x over CSR, 1.8x over RB
 A*x: 2x over CSR, 1.5x over RB
BCSR Captures Regularly Aligned Blocks
 n = 21216
 nnz = 1.5 M
 Source: NASA
structural analysis
problem
 8x8 dense substructure
 Reduces storage
Problem: Forced Alignment
 BCSR(2x2)
 Stored / true nz = 1.24
Problem: Forced Alignment
 BCSR(2x2)
 Stored / true nz = 1.24
 BCSR(3x3)
 Stored / true nz = 1.46
Problem: Forced Alignment Implies UBCSR
 BCSR(2x2)
 Stored / true nz = 1.24
 BCSR(3x3)
 Stored / true nz = 1.46
 Forces i mod 3 = j mod 3 =
0
 Unaligned BCSR format:
 Store row indices
The Speedup Gap: BCSR vs. CSR
The Speedup Gap
Speedup:
BCSR/CSR
1.1—1.5x gap
Machine
Approach: Splitting + Relaxed Block
Alignment
 Goal: Close the gap between FEM classes
 Our approach: Capture actual structure more precisely
 Split: A = A1 + A2 + … + As
 Store each Ai in unaligned BCSR (UBCSR) format
 Relax both row and column alignment
 Buttari, et al. (2005) show improvements from relaxed column
alignment
 2.1x over no blocking, 1.8x over blocking
 When not faster than BCSR, may still reduce storage
Variable Block Row (VBR) Analysis
 Partition by grouping
consecutive
rows/columns having
same pattern
From VBR, Identify Multiple Natural Block
Sizes
VBR with Fill
 Can also pad by
matching rows/columns
with nearly similar
patterns
 Define VBR(q) =
 VBR where consecutive
rows grouped when
“similarity”  q
 0q1
VBR with Fill
q=1
q = 0.7
Fill of 1%
A Complex Tuning Problem
 Many parameters need “tuning”
 Fill threshold, .5  q  1
 Number of splittings, 2  s  4
 Ordering of block sizes, ri´ci; rs´cs = 1´1
 See paper in HPCC 2005 for proof-of-concept
experiments based on a semi-exhaustive search
 Heuristic in progress (uses Buttari, et al. (2005) work)
Matrix
FEM 2
10-ct20stif
Dimensio
n
Matrices
# non-zeros
Dominant blocks
52k
2.7M
6x6 (39%), 3x3 (15%)
20k
1.3M
3x3 (96%)
16k
1.1M
1x1 (38%), 3x3 (23%)
41k
1.7M
2x1 (81%), 2x2 (19%)
23k
1.0M
1x1 (75%), 3x1 (12%)
141k
7.3M
6x6 (82%)
121k
4.8M
2x1 (26%), 1x2 (26%),
1x1 (26%), 2x2 (22%)
218k
11.6M
6x6 (94%)
47k
2.4M
2x2 (17%), 3x2 (15%),
2x3 (15%), 4x2 (9%), 2x4 (9%)
90k
4.8M
6x6 (99%)
Engine block
12-raefsky4
Buckling
13-ex11
Fluid flow
15-Vavasis3
2D PDE
17-rim
Fluid flow
A-bmw7st_1
Car chassis
B-cop20k_m
Accel. Cavity
C-pwtk
Wind tunnel
D-rma10
Charleston Harbor
E-s3dkqm4
Power 4 Performance
Storage Savings
Traveling Salesman Problem-based
Reordering
 Application: Stanford accelerator design problem
(Omega3P)
 Reorder by approximately solving TSP [Pinar & Heath
‘97]






Nodes = columns of A
Weights(u, v) = no. of nz u, v have in common
Tour = ordering of columns
Choose maximum weight tour
See [Pinar & Heath ’97]
Also: symmetric storage, register blocking
 Manually selected optimizations
 Problem: High-cost of computing approximate solution to
TSP
100x100 Submatrix Along Diagonal
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + Red
After: Green + Blue
Inter-Iteration Sparse Tiling (1/3)
x1
t1
y1
x2
t2
y2
 Idea: Strout, et al., ICCS
2001
 Let A be 5x5 tridiagonal
 Consider y=A2x
 t=Ax, y=At
x3
t3
y3
x4
t4
y4
x5
t5
y5
 Nodes: vector elements
 Edges: matrix elements
aij
Inter-Iteration Sparse Tiling (2/3)
x1
t1
y1
x2
t2
y2
 Idea: Strout, et al., ICCS
2001
 Let A be 5x5 tridiagonal
 Consider y=A2x
 t=Ax, y=At
x3
t3
y3
x4
t4
y4
x5
t5
y5
 Nodes: vector elements
 Edges: matrix elements
aij
 Orange = everything
needed to compute y1
 Reuse a11, a12
Inter-Iteration Sparse Tiling (3/3)
x1
t1
y1
x2
t2
y2
 Idea: Strout, et al., ICCS
2001
 Let A be 5x5 tridiagonal
 Consider y=A2x
 t=Ax, y=At
x3
t3
y3
x4
t4
y4
x5
t5
y5
 Nodes: vector elements
 Edges: matrix elements
aij
 Orange = everything
needed to compute y1
 Reuse a11, a12
 Grey = y2, y3
 Reuse a23, a33, a43
Serial Sparse Tiling Performance (Itanium 2)
OSKI Software Architecture and
API

Empirical Model Evaluation
 Tuning loop
 Compute a “tuning time budget” based on workload
 While (time remains and no tuning chosen)
 Try a heuristic
 Heuristic for blocked SpMV: Choose r x c to minimize
estimated flops(A,r,c)
predicted time(A,r,c) =
benchmark Mflop /s(r,c)
 Tuning for workloads
 Weighted sums of empirical models
 Dynamic programming for alternatives
 Example: Combined y=ATAx vs. separate (w=Ax, y=ATw)
Cost of Tuning
 Non-trivial run-time cost: up to ~40 mat-vecs
 Dominated by conversion time (~ 80%)
 Design point: user calls “tune” routine explicitly
 Exposes cost
 Tuning time limited using estimated workload
 Provided by user or inferred by library
 User may save tuning results
 To apply on future runs with similar matrix
 Stored in “human-readable” format
Interface supports legacy app migration
int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Compute y = ·y + ·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
r = ddot (x, y); /* Some dense BLAS op on vectors */
Interface supports legacy app migration
int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols,
SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + ·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
r = ddot (x, y);
Interface supports legacy app migration
int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols,
SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Step 2: Call tune (with optional hints) */
oski_SetHintMatMult (A_tunable, …, 500);
oski_TuneMat (A_tunable);
/* Compute y = ·y + ·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, , x, , y );
r = ddot (x, y);
Interface supports legacy app migration
int* ptr = …, *ind = …; double* val = …; /* Matrix A, in CSR format */
double* x = …, *y = …; /* Vectors */
/* Step 1: Create OSKI wrappers */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols,
SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Step 2: Call tune (with optional hints) */
oski_setHintMatMult (A_tunable, …, 500);
oski_TuneMat (A_tunable);
/* Compute y = ·y + ·A·x, 500 times */
for( i = 0; i < 500; i++ )
oski_MatMult (A_tunable, OP_NORMAL, , x_view, , y_view);// Step 3
r = ddot (x, y);
Quick-and-dirty Parallelism: OSKI-PETSc
 Extend PETSc’s distributed memory SpMV (MATMPIAIJ)
p0
 PETSc
 Each process stores diag
(all-local) and off-diag
submatrices
p1
 OSKI-PETSc:
p2
p3
 Add OSKI wrappers
 Each submatrix tuned
independently
OSKI-PETSc Proof-of-Concept Results
 Matrix 1: Accelerator cavity design (R. Lee @ SLAC)
 N ~ 1 M, ~40 M non-zeros
 2x2 dense block substructure
 Symmetric
 Matrix 2: Linear programming (Italian Railways)
 Short-and-fat: 4k x 1M, ~11M non-zeros
 Highly unstructured
 Big speedup from cache-blocking: no native PETSc format
 Evaluation machine: Xeon cluster
 Peak: 4.8 Gflop/s per node
Accelerator cavity matrix from SLAC’s T3P
code
Additional Features: OSKI-Lua
 Embedded scripting language for selecting customized,
complex transformations
 Mechanism to save/restore transformations
/* In “my_app.c” */
fp = fopen(“my_xform.txt”, “rt”);
fgets(buffer, BUFSIZE, fp);
oski_ApplyMatTransform(A_tunable,
buffer);
oski_MatMult(A_tunable, …);
# In file, “my_xform.txt”
# Compute Afast = P*A*PT using
Pinar’s reordering algorithm
A_fast, P =
reorder_TSP(InputMat);
# Split Afast = A1 + A2, where A1 in 2x2
block format, A2 in CSR
A1, A2 =
A_fast.extract_blocks(2, 2);
return transpose(P)*(A1+A2)*P;
Current Work and Future
Directions
Current and Future Work on OSKI
 OSKI 1.0.1 at bebop.cs.berkeley.edu/oski
 “Pre-alpha” version of OSKI-PETSc available; “Beta” for Kokkos
(Trilinos)
 Future work
 Evaluation on full solves/apps
 Bay area lithography shop - 2x speedup in full solve
 Code generators
 Studying use of higher-level OSKI kernels
 Port to additional architectures (e.g., vectors, SMPs)
 Additional heuristics [Buttari, et al. (2005)]
 Many BeBOP projects on-going





SpMV benchmark for HPC-Challenge [Gavari & Hoemmen]
Evaluation of Cell [Williams]
Higher-level kernels, solvers [Hoemmen, Nishtala]
Tuning collective communications [Nishtala]
Cache-oblivious stencils [Kamil]
ROSE: A Compiler-Based Approach to
Tuning General Applications
 ROSE: Tool for building customized source-to-source tools (Quinlan,
et al.)
 Full support for C and C++; Fortran90 in development
 Targets users with little or no compiler background
 Focus on performance optimization for scientific computing





Domain-specific analysis and optimizations
Object-oriented abstraction recognition
Rich loop-transformation support
Annotation language support
Additional infrastructure support for s/w assurance, testing, and
debugging
 Toward an end-to-end empirical tuning compiler
 Combines profiling, checkpointing, analysis, parameterized code
generation, search
 Joint work with Qing Yi (University of Texas at San Antonio)
 Sponsored by U.S. Department of Energy
ROSE Architecture
Application
Library Interface
Annotations
Front-end (EDG-based)
AST
Tools
Mid-end
AST fragment
Abtraction Recognition
Abstraction Aware Analysis
Abstraction Elimination
Extended Traditional Optimizations
Source+AST Transformations
AST
Back-end
Transformed application source
Source
fragment
Source
fragment
AST fragment
AST fragment
Download