(SpMV) Partitioning - Computer Science Division

advertisement
Automatic Performance Tuning
and
Sparse-Matrix-Vector-Multiplication (SpMV)
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr11
Outline
•
•
•
•
•
Motivation for Automatic Performance Tuning
Results for sparse matrix kernels
OSKI = Optimized Sparse Kernel Interface
Tuning Higher Level Algorithms
Future Work, Class Projects
• BeBOP: Berkeley Benchmarking and Optimization Group
– Many results shown from current and former members
– Meet weekly T 12:30-2, in 511 Soda
Motivation for Automatic Performance Tuning
• Writing high performance software is hard
– Make programming easier while getting high speed
• Ideal: program in your favorite high level language
(Matlab, Python, PETSc…) and get a high fraction of
peak performance
• Reality: Best algorithm (and its implementation) can
depend strongly on the problem, computer
architecture, compiler,…
– Best choice can depend on knowing a lot of applied
mathematics and computer science
• How much of this can we teach?
• How much of this can we automate?
Examples of Automatic Performance Tuning (1)
• Dense BLAS
–
–
–
–
Sequential
PHiPAC (UCB), then ATLAS (UTK) (used in Matlab)
math-atlas.sourceforge.net/
Internal vendor tools
• Fast Fourier Transform (FFT) & variations
– Sequential and Parallel
– FFTW (MIT)
– www.fftw.org
• Digital Signal Processing
– SPIRAL: www.spiral.net (CMU)
• Communication Collectives (UCB, UTK)
• Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), …
• More projects, conferences, government reports, …
Examples of Automatic Performance Tuning (2)
• What do dense BLAS, FFTs, signal processing, MPI reductions
have in common?
– Can do the tuning off-line: once per architecture, algorithm
– Can take as much time as necessary (hours, a week…)
– At run-time, algorithm choice may depend only on few parameters
• Matrix dimension, size of FFT, etc.
Tuning Register Tile Sizes (Dense Matrix Multiply)
333 MHz Sun Ultra 2i
2-D slice of 3-D space;
implementations colorcoded by performance in
Mflop/s
16 registers, but 2-by-3 tile
size fastest
Needle in a haystack
Example: Select a Matmul Implementation
Example: Support Vector Classification
Machine Learning in Automatic Performance Tuning
• References
– Statistical Models for Empirical Search-Based
Performance Tuning
(International Journal of High Performance Computing
Applications, 18 (1), pp. 65-94, February 2004)
Richard Vuduc, J. Demmel, and Jeff A. Bilmes.
– Predicting and Optimizing System Utilization and
Performance via Statistical Machine Learning
(Computer Science PhD Thesis, University of California,
Berkeley. UCB//EECS-2009-181 ) Archana Ganapathi
Examples of Automatic Performance Tuning (3)
• What do dense BLAS, FFTs, signal processing, MPI reductions
have in common?
– Can do the tuning off-line: once per architecture, algorithm
– Can take as much time as necessary (hours, a week…)
– At run-time, algorithm choice may depend only on few parameters
• Matrix dimension, size of FFT, etc.
• Can’t always do off-line tuning
– Algorithm and implementation may strongly depend on
data only known at run-time
– Ex: Sparse matrix nonzero pattern determines both best
data structure and implementation of Sparse-matrixvector-multiplication (SpMV)
– Part of search for best algorithm just be done (very
quickly!) at run-time
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Linear Programming Matrix
…
A Sparse Matrix You Encounter Every Day
SpMV with Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)

y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1]-1 do
y[i] = y[i] + val[k]*x[ind[k]]
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
Example: The Difficulty of Tuning
• n = 21200
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
• 8x8 dense
substructure
Taking advantage of block structure in SpMV
• Bottleneck is time to get matrix from memory
– Only 2 flops for each nonzero in matrix
• Don’t store each nonzero with index, instead store
each nonzero r-by-c block with index
– Storage drops by up to 2x, if rc >> 1, all 32-bit quantities
– Time to fetch matrix from memory decreases
• Change both data structure and algorithm
– Need to pick r and c
– Need to change algorithm accordingly
• In example, is r=c=8 best choice?
– Minimizes storage, so looks like a good idea…
Speedups on Itanium 2: The Need for Search
Mflop/s
Best: 4x2
Reference
Mflop/s
Register Profile: Itanium 2
1190 Mflop/s
190 Mflop/s
Power3 - 13%
195 Mflop/s
Power4 - 14%
703 Mflop/s
SpMV Performance (Matrix #2): Generation 1
100 Mflop/s
Itanium 1 - 7%
225 Mflop/s
103 Mflop/s
469 Mflop/s
Itanium 2 - 31%
1.1 Gflop/s
276 Mflop/s
Power3 - 17%
252 Mflop/s
Power4 - 16%
820 Mflop/s
Register Profiles: IBM and Intel IA-64
122 Mflop/s
Itanium 1 - 8%
247 Mflop/s
107 Mflop/s
459 Mflop/s
Itanium 2 - 33%
1.2 Gflop/s
190 Mflop/s
Ultra 2i - 11%
72 Mflop/s
Ultra 3 - 5%
90 Mflop/s
Register Profiles: Sun and Intel x86
35 Mflop/s
Pentium III - 21%
108 Mflop/s
42 Mflop/s
50 Mflop/s
Pentium III-M - 15%
122 Mflop/s
58 Mflop/s
Another example of tuning challenges
• More complicated
non-zero structure
in general
• N = 16614
• NNZ = 1.1M
Zoom in to top corner
• More complicated
non-zero structure
in general
• N = 16614
• NNZ = 1.1M
3x3 blocks look natural, but…
• More complicated non-zero
structure in general
• Example: 3x3 blocking
– Logical grid of 3x3 cells
• But would lead to lots of “fill-in”
Extra Work Can Improve Efficiency!
• More complicated non-zero
structure in general
• Example: 3x3 blocking
–
–
–
–
Logical grid of 3x3 cells
Fill-in explicit zeros
Unroll 3x3 block multiplies
“Fill ratio” = 1.5
• On Pentium III: 1.5x speedup!
– Actual mflop rate 1.52 = 2.25
higher
Automatic Register Block Size Selection
• Selecting the r x c block size
– Off-line benchmark
• Precompute Mflops(r,c) using dense A for each r x c
• Once per machine/architecture
– Run-time “search”
• Sample A to estimate Fill(r,c) for each r x c
– Run-time heuristic model
• Choose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix
– Fraction of matrix to sample: s  [0,1]
– Cost ~ O(s * nnz)
– Control cost by controlling s
• Search at run-time: the constant matters!
• Control s automatically by computing statistical confidence
intervals
– Idea: Monitor variance
• Cost of tuning
– Lower bound: convert matrix in 5 to 40 unblocked SpMVs
– Heuristic: 1 to 11 SpMVs
Accuracy of the Tuning Heuristics (1/4)
See p. 375 of Vuduc’s thesis for matrices
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
Accuracy of the Tuning Heuristics (2/4)
DGEMV
Accuracy of the Tuning Heuristics (2/4)
Upper Bounds on Performance for blocked SpMV
• P = (flops) / (time)
– Flops = 2 * nnz(A)
• Lower bound on time: Two main assumptions
– 1. Count memory ops only (streaming)
– 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes
• Account for matrix size and nnz
• Charge minimum access “latency” ai at Li cache & amem
– e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Time   a i  Hits i  a mem  Hits mem
i 1

 a1  Loads   (a i 1  a i )  Misses i  (a mem  a  )  Misses 
i 1
Example: L2 Misses on Itanium 2
Misses measured using PAPI [Browne ’00]
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Summary of Other Sequential Performance Optimizations
• Optimizations for SpMV
–
–
–
–
–
–
–
–
Register blocking (RB): up to 4x over CSR
Variable block splitting: 2.1x over CSR, 1.8x over RB
Diagonals: 2x over CSR
Reordering to create dense structure + splitting: 2x over CSR
Symmetry: 2.8x over CSR, 2.6x over RB
Cache blocking: 2.8x over CSR
Multiple vectors (SpMM): 7x over CSR
And combinations…
• Sparse triangular solve
– Hybrid sparse/dense data structure: 1.8x over CSR
• Higher-level kernels
– A·AT·x, AT·A·x: 4x over CSR, 1.8x over RB
– A2·x: 2x over CSR, 1.5x over RB
– [A·x, A2·x, A3·x, .. , Ak·x]
Example: Sparse Triangular Factor
• Raefsky4 (structural
problem) + SuperLU +
colmmd
• N=19779, nnz=12.6 M
Dense trailing triangle:
dim=2268, 20% of total
nz
Can be as high as 90+%!
1.8x over CSR
Cache Optimizations for AAT*x
• Cache-level: Interleave multiplication by A, AT
– Only fetch A from memory once
T
1
T
…
1
n
T
n
…
a

AA  x  a  a  

a


n

T
  x   ai (ai x)

i 1

“axpy”
dot product
• Register-level: aiT to be rc block row, or diag row
Example: Combining Optimizations (1/2)
• Register blocking, symmetry, multiple (k) vectors
– Three low-level tuning parameters: r, c, v
X
k
v
*
r
c
+=
Y
A
Example: Combining Optimizations (2/2)
• Register blocking, symmetry, and multiple vectors
[Ben Lee @ UCB]
– Symmetric, blocked, 1 vector
• Up to 2.6x over nonsymmetric, blocked, 1 vector
– Symmetric, blocked, k vectors
• Up to 2.1x over nonsymmetric, blocked, k vectors
• Up to 7.3x over nonsymmetric, nonblocked, 1 vector
– Symmetric Storage: up to 64.7% savings
Potential Impact on Applications: Omega3P
• Application: accelerator cavity design [Ko]
• Relevant optimization techniques
– Symmetric storage
– Register blocking
– Reordering, to create more dense blocks
• Reverse Cuthill-McKee ordering to reduce bandwidth
– Do Breadth-First-Search, number nodes in reverse order visited
• Traveling Salesman Problem-based ordering to create blocks
–
–
–
–
–
Nodes = columns of A
Weights(u, v) = no. of nz u, v have in common
Tour = ordering of columns
Choose maximum weight tour
See [Pinar & Heath ’97]
• 2.1x speedup on Power 4 (caveat: SPMV not bottleneck)
Source: Accelerator Cavity Design Problem (Ko via Husbands)
Post-RCM Reordering
100x100 Submatrix Along Diagonal
“Microscopic” Effect of RCM Reordering
Before: Green + Red
After: Green + Blue
“Microscopic” Effect of Combined RCM+TSP Reordering
Before: Green + Red
After: Green + Blue
(Omega3P)
Optimized Sparse Kernel Interface - OSKI
• Provides sparse kernels automatically tuned for
user’s matrix & machine
– BLAS-style functionality: SpMV, Ax & ATy, TrSV
– Hides complexity of run-time tuning
– Includes new, faster locality-aware kernels: ATAx, Akx
• Faster than standard implementations
– Up to 4x faster matvec, 1.8x trisolve, 4x ATA*x
• For “advanced” users & solver library writers
– Available as stand-alone library (OSKI 1.0.1h, 6/07)
– Available as PETSc extension (OSKI-PETSc .1d, 3/06)
– Bebop.cs.berkeley.edu/oski
• Under development: pOSKI for multicore
How the OSKI Tunes (Overview)
Application Run-Time
Library Install-Time (offline)
1. Build for
Target
Arch.
2. Benchmark
Generated
code
variants
Benchmark
data
Matrix
Workload
from program
monitoring
History
1. Evaluate
Models
Heuristic
models
2. Select
Data Struct.
& Code
To user:
Matrix handle
for kernel
calls
Extensibility: Advanced users may write & dynamically add “Code variants” and “Heuristic models” to system.
How the OSKI Tunes (Overview)
• At library build/install-time
– Pre-generate and compile code variants into dynamic libraries
– Collect benchmark data
• Measures and records speed of possible sparse data structure and
code variants on target architecture
– Installation process uses standard, portable GNU AutoTools
• At run-time
– Library “tunes” using heuristic models
• Models analyze user’s matrix & benchmark data to choose optimized
data structure and code
– Non-trivial tuning cost: up to ~40 mat-vecs
• Library limits the time it spends tuning based on estimated workload
– provided by user or inferred by library
• User may reduce cost by saving tuning results for application on
future runs with same or similar matrix
Optimizations in OSKI, so far
• Fully automatic heuristics for
– Sparse matrix-vector multiply
• Register-level blocking
• Register-level blocking + symmetry + multiple vectors
• Cache-level blocking
– Sparse triangular solve with register-level blocking and “switch-todense” optimization
– Sparse ATA*x with register-level blocking
• User may select other optimizations manually
– Diagonal storage optimizations, reordering, splitting; tiled matrix
powers kernel (Ak*x)
– All available in dynamic libraries
– Accessible via high-level embedded script language
• “Plug-in” extensibility
– Very advanced users may write their own heuristics, create new data
structures/code variants and dynamically add them to the system
• pOSKI under construction
– Optimizations for multicore – more later
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
my_matmult( ptr, ind, val, a, x, , y );
How to Call OSKI: Basic Usage
• May gradually migrate existing apps
– Step 1: “Wrap” existing data structures
– Step 2: Make BLAS-like kernel calls
int* ptr = …, *ind = …; double* val = …; /* Matrix, in CSR format */
double* x = …, *y = …; /* Let x and y be two dense vectors */
/* Step 1: Create OSKI wrappers around this data */
oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows,
num_cols, SHARE_INPUTMAT, …);
oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);
oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);
/* Compute y = ·y + a·A·x, 500 times */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);/* Step 2 */
How to Call OSKI: Tune with Explicit Hints
• User calls “tune” routine
– May provide explicit tuning hints (OPTIONAL)
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
/* Tell OSKI we will call SpMV 500 times (workload hint) */
oski_SetHintMatMult(A_tunable, OP_NORMAL, a, x_view, , y_view, 500);
/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */
oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
for( i = 0; i < 500; i++ )
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
How the User Calls OSKI: Implicit Tuning
• Ask library to infer workload
– Library profiles all kernel calls
– May periodically re-tune
oski_matrix_t A_tunable = oski_CreateMatCSR( … );
/* … */
for( i = 0; i < 500; i++ ) {
oski_MatMult(A_tunable, OP_NORMAL, a, x_view, , y_view);
oski_TuneMat(A_tunable); /* Ask OSKI to tune */
}
Quick-and-dirty Parallelism: OSKI-PETSc
• Extend PETSc’s distributed memory SpMV (MATMPIAIJ)
p0
• PETSc
– Each process stores diag
(all-local) and off-diag
submatrices
p1
• OSKI-PETSc:
p2
p3
– Add OSKI wrappers
– Each submatrix tuned
independently
OSKI-PETSc Proof-of-Concept Results
• Matrix 1: Accelerator cavity design (R. Lee @ SLAC)
– N ~ 1 M, ~40 M non-zeros
– 2x2 dense block substructure
– Symmetric
• Matrix 2: Linear programming (Italian Railways)
– Short-and-fat: 4k x 1M, ~11M non-zeros
– Highly unstructured
– Big speedup from cache-blocking: no native PETSc format
• Evaluation machine: Xeon cluster
– Peak: 4.8 Gflop/s per node
Accelerator Cavity Matrix
OSKI-PETSc Performance: Accel. Cavity
Linear Programming Matrix
…
OSKI-PETSc Performance: LP Matrix
Tuning SpMV for Multicore
Architectural Review
Cost of memory access in Multicore
• When physically partitioned, cache or memory access
is non uniform (latency and bandwidth to
memory/cache addresses varies)
– NUMA = Non-Uniform Memory Access
– NUCA = Non-Uniform Cache Access
• UCA & UMA architecture:
Core
Core
Core
Core
Cache
DRAM
Source: Sam Williams
Cost of memory access in Multicore
• When physically partitioned, cache or memory access
is non uniform (latency and bandwidth to
memory/cache addresses varies)
– NUMA = Non-Uniform Memory Access
– NUCA = Non-Uniform Cache Access
• UCA & UMA architecture:
Core
Core
Core
Core
Cache
DRAM
Source: Sam Williams
Cost of Memory Access in Multicore
• When physically partitioned, cache or memory access
is non uniform (latency and bandwidth to
memory/cache addresses varies)
– NUMA = Non-Uniform Memory Access
– NUCA = Non-Uniform Cache Access
• NUCA & UMA architecture:
Core
Core
Core
Core
Cache
Cache
Cache
Cache
Memory Controller Hub
DRAM
Source: Sam Williams
Cost of Memory Access in Multicore
• When physically partitioned, cache or memory access
is non uniform (latency and bandwidth to
memory/cache addresses varies)
– NUMA = Non-Uniform Memory Access
– NUCA = Non-Uniform Cache Access
• NUCA & NUMA architecture:
Core
Core
Core
Core
Cache
Cache
Cache
Cache
DRAM
DRAM
Source: Sam Williams
Cost of Memory Access in Multicore
• Proper cache locality to maximize performance
Core
Core
Core
Core
Cache
Cache
Cache
Cache
DRAM
DRAM
Source: Sam Williams
Cost of Memory Access in Multicore
• Proper DRAM locality to maximize performance
Core
Core
Core
Core
Cache
Cache
Cache
Cache
DRAM
DRAM
Source: Sam Williams
Multicore SMPs
of Interest
Multicore SMPs Used
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
Source: Sam Williams
Multicore SMPs Used
(Conventional cache-based memory hierarchy)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
Source: Sam Williams
Multicore SMPs Used
(Local store-based memory hierarchy)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
Source: Sam Williams
Multicore SMPs Used
(CMT = Chip-MultiThreading)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
Source: Sam Williams
Multicore SMPs Used
(threads)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
8 threads
8 threads
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
128 threads
16* threads
Source: Sam Williams
*SPEs
only
Multicore SMPs Used
(peak double precision flops)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
75 GFlop/s
74 Gflop/s
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
19 GFlop/s
29* GFlop/s
Source: Sam Williams
*SPEs
only
Multicore SMPs Used
(Total DRAM bandwidth)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
21 GB/s (read)
10 GB/s (write)
21 GB/s
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
42 GB/s (read)
21 GB/s (write)
51 GB/s
Source: Sam Williams
*SPEs
only
Multicore SMPs Used
(Non-Uniform Memory Access - NUMA)
Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona)
Sun T2+ T5140 (Victoria Falls)
IBM QS20 Cell Blade
Source: Sam Williams
*SPEs
only
Results from
“Auto-tuning Sparse Matrix-Vector
Multiplication (SpMV)”
Samuel Williams, Leonid Oliker, Richard Vuduc,
John Shalf, Katherine Yelick, James Demmel,
"Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms",
Supercomputing (SC), 2007.
Test matrices
•
•
•
Suite of 14 matrices
All bigger than the caches of our SMPs
We’ll also include a median performance number
2K x 2K Dense matrix
stored in sparse format
Dense
Well Structured
(sorted by nonzeros/row)
Protein
FEM /
Spheres
FEM /
Cantilever
FEM /
Accelerator
Circuit
webbase
Poorly Structured
hodgepodge
Extreme Aspect Ratio
(linear programming)
LP
Source: Sam Williams
Wind
Tunnel
FEM /
Harbor
QCD
FEM /
Ship
Economics Epidemiology
SpMV Parallelization
• How do we parallelize a matrix-vector multiplication ?
Source: Sam Williams
SpMV Parallelization
How do we parallelize a matrix-vector multiplication ?
By rows blocks
No inter-thread data dependencies, but random access to x
thread 3
thread 2
thread 1
thread 0
•
•
•
Source: Sam Williams
SpMV Performance
(simple parallelization)
•
•
•
•
Out-of-the box SpMV
performance on a suite of
14 matrices
Simplest solution =
parallelization by rows
(solves data dependency
challenge)
Scalability isn’t great
Is this performance
good?
Naïve Pthreads
Naïve
Source: Sam Williams
NUMA
(Data Locality for Matrices)
•
On NUMA architectures, all large arrays should be partitioned either
– explicitly (multiple malloc()’s + affinity)
– implicitly (parallelize initialization and rely on first touch)
•
You cannot partition on granularities less than the page size
– 512 elements on x86
– 2M elements on Niagara
•
•
For SpMV, partition the matrix and
perform multiple malloc()’s
Pin submatrices so they are
co-located with the cores tasked
to process them
Source: Sam Williams
NUMA
(Data Locality for Matrices)
Source: Sam Williams
Prefetch for SpMV
• SW prefetch injects more MLP into the
memory subsystem.
• Supplement HW prefetchers
• Can try to prefetch the
–
–
–
–
values
indices
source vector
or any combination thereof
• In general, should only insert one
prefetch per cache line (works best on
unrolled code)
Source: Sam Williams
for(all rows){
y0 = 0.0;
y1 = 0.0;
y2 = 0.0;
y3 = 0.0;
for(all tiles in this row){
PREFETCH(V+i+PFDistance);
y0+=V[i ]*X[C[i]]
y1+=V[i+1]*X[C[i]]
y2+=V[i+2]*X[C[i]]
y3+=V[i+3]*X[C[i]]
}
y[r+0] = y0;
y[r+1] = y1;
y[r+2] = y2;
y[r+3] = y3;
}
SpMV Performance
(NUMA and Software Prefetching)
 NUMA-aware allocation is
essential on memorybound NUMA SMPs.
 Explicit software
prefetching can boost
bandwidth and change
cache replacement
policies
 Cell PPEs are likely
latency-limited.
 used exhaustive search
for best prefetch distance
Source: Sam Williams
Matrix Compression
• Goal: minimize memory traffic
• Register blocking
– Choose block size to minimize memory traffic
– Only power-of-2 block sizes
– Simplifies search, achieves most of the possible speedup
• Shorter indices
– 32-bit, or 16-bit if possible
• Different sparse matrix formats
– BCSR – Block compressed sparse row
• Like CSR but with register blocks
– BCOO – Block coordinate
• Stores row and column index of each register block
• Better on very sparse sub-blocks (see cache blocking later)
SpMV Performance
(Matrix Compression)
After maximizing memory
bandwidth, the only hope is
to minimize memory traffic.
 exploit:

 register blocking
 other formats
 smaller indices
Use a traffic minimization
heuristic rather than search
 Benefit is clearly
matrix-dependent.
 Register blocking enables
efficient software prefetching
(one per cache line)

Source: Sam Williams
Cache blocking for SpMV
The columns spanned by each cache
block are selected to use same space
in cache, i.e. access same number of x(i)
•
TLB blocking is a similar concept but
instead of on 8 byte granularities,
it uses 4KB granularities
thread 1
•
thread 2
Store entire submatrices contiguously
thread 3
•
thread 0
(Data Locality for Vectors)
Source: Sam Williams
Cache blocking for SpMV
The columns spanned by each cache
block are selected to use same space
in cache, i.e. access same number of x(i)
•
TLB blocking is a similar concept but
instead of on 8 byte granularities,
it uses 4KB granularities
thread 1
•
thread 2
Store entire submatrices contiguously
thread 3
•
thread 0
(Data Locality for Vectors)
Source: Sam Williams
Auto-tuned SpMV Performance
(cache and TLB blocking)
•
•
Fully auto-tuned SpMV
performance across the suite
of matrices
Why do some optimizations
work better on some
architectures?
•
matrices with naturally
small working sets
•
architectures with giant
caches
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Source: Sam Williams
Auto-tuned SpMV Performance
(architecture specific optimizations)
•
•
•
Fully auto-tuned SpMV
performance across the suite
of matrices
Included SPE/local store
optimized version
Why do some optimizations
work better on some
architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Source: Sam Williams
Auto-tuned SpMV Performance
(max speedup)
•
2.7x
2.9x
4.0x
35x
•
•
Fully auto-tuned SpMV
performance across the suite
of matrices
Included SPE/local store
optimized version
Why do some optimizations
work better on some
architectures?
+Cache/LS/TLB Blocking
+Matrix Compression
+SW Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve
Source: Sam Williams
Tuning Higher Level Algorithms than SpMV
• We almost always do many SpMVs, not just one
– “Krylov Subspace Methods” (KSMs) for Ax=b, Ax = λx
• Conjugate Gradients, GMRES, Lanczos, …
– Do a sequence of k SpMVs to get vectors [x1 , … , xk]
– Find best solution x as linear combination of [x1 , … , xk]
• Main cost is k SpMVs
• Since communication usually dominates, can we do better?
• Goal: make communication cost independent of k
– Parallel case: O(log P) messages, not O(k log P) - optimal
• same bandwidth as before
– Sequential case: O(1) messages and bandwidth, not O(k) - optimal
• Achievable when matrix partitionable with low surface-tovolume ratio
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4 …
• Example: A tridiagonal, n=32, k=3
• Works for any “well-partitioned” A
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4 …
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4 …
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4 …
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4 …
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
A3·x
A2·x
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Sequential Algorithm
A3·x
A2·x
Step 1
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Sequential Algorithm
A3·x
A2·x
Step 1
Step 2
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Sequential Algorithm
A3·x
A2·x
Step 1
Step 2
Step 3
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Sequential Algorithm
A3·x
A2·x
Step 1
Step 2
Step 3
Step 4
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 1
Proc 2
Proc 3
Proc 4
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 1
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 2
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 3
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 4
A·x
x
1 2 3 4…
• Example: A tridiagonal, n=32, k=3
… 32
Communication Avoiding Kernels:
The Matrix Powers Kernel : [Ax, A2x, …, Akx]
• Replace k iterations of y = Ax with [Ax, A2x, …, Akx]
• Parallel Algorithm
A3·x
A2·x
Proc 1
Proc 2
Proc 3
Proc 4
A·x
x
1 2 3 4…
… 32
• Example: A tridiagonal, n=32, k=3
• Entries in overlapping regions (triangles) computed
redundantly
Locally Dependent Entries for [x,Ax,…,A8x], A tridiagonal
2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Can be computed without communication
k=8 fold reuse of A
Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal
2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
One message to get data needed to compute remotely dependent entries, not k=8
Fewer Remotely Dependent Entries for [x,Ax,…,A8x], A tridiagonal
2 processors
Proc 1
Proc 2
A8x
A7x
A6x
A5x
A4x
A3x
A2x
Ax
x
Reduce redundant work by half
Remotely Dependent Entries for [x,Ax, A2x,A3x], 2D Laplacian
Remotely Dependent Entries for [x,Ax,A2x,A3x],
A irregular, multiple processors
Performance Results
•
•
•
•
Measured Multicore (Clovertown) speedups up to 6.4x
Measured/Modeled sequential OOC speedup up to 3x
Modeled parallel Petascale speedup up to 6.9x
Modeled parallel Grid speedup up to 22x
• Sequential speedup due to bandwidth, works for many
problem sizes
• Parallel speedup due to latency, works for smaller problems
on many processors
Speedups on Intel Clovertown (8 core)
Avoiding Communication in Iterative Linear Algebra
• k-steps of typical iterative solver for sparse Ax=b or Ax=λx
– Does k SpMVs with starting vector
– Finds “best” solution among all linear combinations of these k+1 vectors
– Many such “Krylov Subspace Methods”
• Conjugate Gradients, GMRES, Lanczos, Arnoldi, …
• Goal: minimize communication in Krylov Subspace Methods
– Assume matrix “well-partitioned,” with modest surface-to-volume ratio
– Parallel implementation
• Conventional: O(k log p) messages, because k calls to SpMV
• New: O(log p) messages - optimal
– Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• New: O(1) moves of data – optimal
• Lots of speed up possible (modeled and measured)
– Price: some redundant computation
• Much prior work
– See bebop.cs.berkeley.edu
– CG: [van Rosendale, 83], [Chronopoulos and Gear, 89]
– GMRES: [Walker, 88], [Joubert and Carey, 92], [Bai et al., 94]
Minimizing Communication of GMRES to solve Ax=b
• GMRES: find x in span{b,Ab,…,Akb} minimizing || Ax-b ||2
• Cost of k steps of standard GMRES vs new GMRES
Standard GMRES
for i=1 to k
w = A · v(i-1)
MGS(w, v(0),…,v(i-1))
update v(i), H
endfor
solve LSQ problem with H
Communication-avoiding GMRES
W = [ v, Av, A2v, … , Akv ]
[Q,R] = TSQR(W) … “Tall Skinny QR”
Build H from R, solve LSQ problem
Sequential: #words_moved =
O(k·nnz) from SpMV
+ O(k2·n) from MGS
Parallel: #messages =
O(k) from SpMV
+ O(k2 · log p) from MGS
Sequential: #words_moved =
O(nnz) from SpMV
+ O(k·n) from TSQR
Parallel: #messages =
O(1) from computing W
+ O(log p) from TSQR
•Oops – W from power method, precision lost!
“Monomial” basis [Ax,…,Akx]
fails to converge
A different polynomial basis does converge
Speed ups of GMRES on 8-core Intel Clovertown
Requires co-tuning kernels [MHDY09]
Extensions
• Other Krylov methods
– Arnoldi, CG, Lanczos, …
• Preconditioning
– Solve MAx=Mb where preconditioning matrix M chosen to make
system “easier”
• M approximates A-1 somehow, but cheaply, to accelerate convergence
– Cheap as long as contributions from “distant” parts of the system
can be compressed
• Sparsity
• Low rank
– No implementations yet (class projects!)
Design Space for [x,Ax,…,Akx] (1/3)
• Mathematical Operation
– How many vectors to keep
• All: Krylov Subspace Methods
• Keep last vector Akx only (Jacobi, Gauss Seidel)
– Improving conditioning of basis
• W = [x, p1(A)x, p2(A)x,…,pk(A)x]
• pi(A) = degree i polynomial chosen to reduce cond(W)
– Preconditioning (Ay=b  MAy=Mb)
• [x,Ax,MAx,AMAx,MAMAx,…,(MA)kx]
Design Space for [x,Ax,…,Akx] (2/3)
• Representation of sparse A
– Zero pattern may be explicit or implicit
– Nonzero entries may be explicit or implicit
• Implicit  save memory, communication
Explicit pattern
Implicit pattern
Explicit nonzeros
General sparse matrix Image segmentation
Implicit nonzeros
Laplacian(graph)
Multigrid (AMR)
“Stencil matrix”
Ex: tridiag(-1,2,-1)
• Representation of dense preconditioners M
– Low rank off-diagonal blocks (semiseparable)
Design Space for [x,Ax,…,Akx] (3/3)
• Parallel implementation
– From simple indexing, with redundant flops  surface/volume ratio
– To complicated indexing, with fewer redundant flops
• Sequential implementation
– Depends on whether vectors fit in fast memory
• Reordering rows, columns of A
– Important in parallel and sequential cases
– Can be reduced to pair of Traveling Salesmen Problems
• Plus all the optimizations for one SpMV!
Summary
• Communication-Avoiding Linear Algebra (CALA)
• Lots of related work
– Some going back to 1960’s
– Reports discuss this comprehensively, not here
• Our contributions
– Several new algorithms, improvements on old ones
• Preconditioning
– Unifying parallel and sequential approaches to avoiding
communication
– Time for these algorithms has come, because of growing
communication costs
– Why avoid communication just for linear algebra motifs?
Possible Class Projects
• Come to BEBOP meetings (T 12:30 – 2, 511 Soda)
• Experiment with SpMV on GPU
– Which optimizations are most effective?
• Try to speed up particular matrices of interest
– Data mining
• Experiment with new [x,Ax,…,Akx] kernel
– GPU, multicore, distributed memory
– On matrices of interest
• “Bottom solver” in multigrid / AMR (Chombo)
– Experiment with solvers using this kernel
• New Krylov subspace methods, preconditioning
• Experiment with new frameworks (SPF)
• See proposals for details
Extra Slides
Optimizing Communication Complexity of
Sparse Solvers
• Need to modify high level algorithms to use new
kernel
• Example: GMRES for Ax=b where A= 2D Laplacian
– x lives on n-by-n mesh
– Partitioned on p½ -by- p½ processor grid
– A has “5 point stencil” (Laplacian)
• (Ax)(i,j) = linear_combination(x(i,j), x(i,j±1), x(i±1,j))
– Ex: 18-by-18 mesh on 3-by-3 processor grid
Minimizing Communication
• What is the cost = (#flops, #words, #mess) of s steps of
standard GMRES?
GMRES, ver.1:
for i=1 to s
w = A * v(i-1)
MGS(w, v(0),…,v(i-1))
update v(i), H
endfor
solve LSQ problem with H
n/p½
n/p½
• Cost(A * v) = s * (9n2 /p, 4n / p½ , 4 )
• Cost(MGS = Modified Gram-Schmidt) = s2/2 * ( 4n2 /p , log p , log p )
• Total cost ~ Cost( A * v ) + Cost (MGS)
• Can we reduce the latency?
Minimizing Communication
• Cost(GMRES, ver.1) = Cost(A*v) + Cost(MGS)
= ( 9sn2 /p, 4sn / p½ , 4s ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 )
• How much latency cost from A*v can you avoid? Almost all
GMRES, ver. 2:
W = [ v, Av, A2v, … , Asv ]
[Q,R] = MGS(W)
Build H from R, solve LSQ problem
• Cost(W) = ( ~ same, ~ same , 8 )
• Latency cost independent of s – optimal
• Cost (MGS) unchanged
• Can we reduce the latency more?
s=3
Minimizing Communication
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Asv ]
[Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11)
Build H from R, solve LSQ problem
W=
W1
W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Latency cost independent of s - optimal
Minimizing Communication
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Asv ]
[Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11)
Build H from R, solve LSQ problem
W1
W = W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Oops
Minimizing Communication
• Cost(GMRES, ver. 2) = Cost(W) + Cost(MGS)
= ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 )
• How much latency cost from MGS can you avoid? Almost all
GMRES, ver. 3:
W = [ v, Av, A2v, … , Asv ]
[Q,R] = TSQR(W) … “Tall Skinny QR” (See Lecture 11)
Build H from R, solve LSQ problem
W1
W = W2
W3
W4
R1
R2
R3
R4
R12
R1234
R34
• Cost(TSQR) = ( ~ same, ~ same , log p )
• Oops – W from power method, precision lost!
Minimizing Communication
• Cost(GMRES, ver. 3) = Cost(W) + Cost(TSQR)
= ( 9sn2 /p, 4sn / p½ , 8 ) + ( 2s2n2 /p , s2 log p / 2 , log p )
• Latency cost independent of s, just log p – optimal
• Oops – W from power method, so precision lost – What to do?
• Use a different polynomial basis
• Not Monomial basis W = [v, Av, A2v, …], instead …
• Newton Basis WN = [v, (A – θ1 I)v , (A – θ2 I)(A – θ1 I)v, …] or
• Chebyshev Basis WC = [v, T1(v), T2(v), …]
Tuning Higher Level Algorithms
• So far we have tuned a single sparse matrix kernel
• y = AT*A*x motivated by higher level algorithm (SVD)
• What can we do by extending tuning to a higher level?
• Consider Krylov subspace methods for Ax=b, Ax = lx
• Conjugate Gradients (CG), GMRES, Lanczos, …
• Inner loop does y=A*x, dot products, saxpys, scalar ops
• Inner loop costs at least O(1) messages
• k iterations cost at least O(k) messages
• Our goal: show how to do k iterations with O(1) messages
• Possible payoff – make Krylov subspace methods much
faster on machines with slow networks
• Memory bandwidth improvements too (not discussed)
• Obstacles: numerical stability, preconditioning, …
Krylov Subspace Methods for Solving Ax=b
• Compute a basis for a subspace V by doing y = A*x
k times
• Find “best” solution in that Krylov subspace V
• Given starting vector x1, V spanned by x2 = A*x1, x3
= A*x2 , … , xk = A*xk-1
• GMRES finds an orthogonal basis of V by
Gram-Schmidt, so it actually does a different set of
SpMVs than in last bullet
Example: Standard GMRES
r = b - A*x1,  = length(r), v1 = r / 
… length(r) = sqrt(S ri2 )
for m= 1 to k do
w = A*vm … at least O(1) messages
for i = 1 to m do
… Gram-Schmidt
him = dotproduct(vi , w ) … at least O(1) messages, or log(p)
w = w – h im * vi
end for
hm+1,m = length(w) … at least O(1) messages, or log(p)
vm+1 = w / hm+1,m
end for
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Vk * y
… Vk = [v1 , v2 , … , vk ]
O(k2), or O(k2 log p), messages altogether
Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal
Different basis for same Krylov subspace
What can Proc 1 compute without communication?
Proc 2
Proc 1
. . .
(A8x)(1:30):
(A2x)(1:30):
(Ax)(1:30):
x(1:30):
Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal
Computing missing entries with 1 communication, redundant work
Proc 1
. . .
(A8x)(1:30):
(A2x)(1:30):
(Ax)(1:30):
x(1:30):
Proc 2
Example: Computing [Ax,A2x,A3x,…,Akx] for A tridiagonal
Saving half the redundant work
Proc 1
. . .
(A8x)(1:30):
(A2x)(1:30):
(Ax)(1:30):
x(1:30):
Proc 2
Example: Computing [Ax,A2x,A3x,…,Akx] for Laplacian
A = 5pt Laplacian in 2D,
Communicated point for k=3 shown
Latency-Avoiding GMRES (1)
r = b - A*x1,  = length(r), v1 = r /  … O(log p) messages
Wk+1 = [v1 , A * v1 , A2 * v1 , … , Ak * v1 ] … O(1) messages
[Q, R] = qr(Wk+1) … QR decomposition, O(log p) messages
Hk = R(:, 2:k+1) * (R(1:k,1:k))-1 … small, local problem
find y minimizing length( Hk * y – e1 ) … small, local problem
new x = x1 + Qk * y … local problem
O(log p) messages altogether
Independent of k
Latency-Avoiding GMRES (2)
[Q, R] = qr(Wk+1)
… QR decomposition,
O(log p) messages
Easy, but not so stable way to do it:
X(myproc) = Wk+1T(myproc) * Wk+1 (myproc)
… local computation
Y = sum_reduction(X(myproc)) … O(log p) messages
… Y = Wk+1T* Wk+1
R = (cholesky(Y))T … small, local computation
Q(myproc) = Wk+1 (myproc) * R-1 … local computation
Numerical example (1)
Diagonal matrix with n=1000, Aii from 1 down to 10-5
Instability as k grows, after many iterations
Numerical Example (2)
Partial remedy: restarting periodically (every 120 iterations)
Other remedies: high precision, different basis than [v , A * v , … , Ak * v ]
Operation Counts for [Ax,A2x,A3x,…,Akx] on p procs
Problem
Per-proc cost Standard
Optimized
1D mesh
#messages
2k
2
2k
2k
#flops
5kn/p
5kn/p + 5k2
memory
(k+4)n/p
(k+4)n/p + 8k
#messages
26k
26
6kn2p-2/3 + 12knp-1/3
+ O(k)
6kn2p-2/3 + 12k2np-1/3
+ O(k3)
#flops
53kn3/p
53kn3/p + O(k2n2p-2/3)
memory
(k+28)n3/p+ 6n2p-2/3
+ O(np-1/3)
(k+28)n3/p + 168kn2p-2/3
+ O(k2np-1/3)
(tridiagonal) #words sent
3D mesh
27 pt stencil #words sent
Summary and Future Work
• Dense
– LAPACK
– ScaLAPACK
• Communication primitives
• Sparse
– Kernels, Stencils
– Higher level algorithms
• All of the above on new architectures
– Vector, SMPs, multicore, Cell, …
• High level support for tuning
– Specification language
– Integration into compilers
Extra Slides
A Sparse Matrix You Encounter Every Day
Who am I?
I am a
Big Repository
Of useful
And useless
Facts alike.
Who am I?
(Hint: Not your e-mail
inbox.)
What about the Google Matrix?
• Google approach
– Approx. once a month: rank all pages using connectivity structure
• Find dominant eigenvector of a matrix
– At query-time: return list of pages ordered by rank
• Matrix: A = aG + (1-a)(1/n)uuT
– Markov model: Surfer follows link with probability a, jumps to a
random page with probability 1-a
– G is n x n connectivity matrix [n  billions]
• gij is non-zero if page i links to page j
• Normalized so each column sums to 1
• Very sparse: about 7—8 non-zeros per row (power law dist.)
– u is a vector of all 1 values
– Steady-state probability xi of landing on page i is solution to x = Ax
• Approximate x by power method: x = Akx0
– In practice, k  25
Current Work
• Public software release
• Impact on library designs: Sparse BLAS, Trilinos, PETSc, …
• Integration in large-scale applications
– DOE: Accelerator design; plasma physics
– Geophysical simulation based on Block Lanczos (ATA*X; LBL)
• Systematic heuristics for data structure selection?
• Evaluation of emerging architectures
– Revisiting vector micros
• Other sparse kernels
– Matrix triple products, Ak*x
• Parallelism
• Sparse benchmarks (with UTK) [Gahvari & Hoemmen]
• Automatic tuning of MPI collective ops [Nishtala, et al.]
Summary of High-Level Themes
• “Kernel-centric” optimization
– Vs. basic block, trace, path optimization, for instance
– Aggressive use of domain-specific knowledge
• Performance bounds modeling
– Evaluating software quality
– Architectural characterizations and consequences
• Empirical search
– Hybrid on-line/run-time models
• Statistical performance models
– Exploit information from sampling, measuring
Related Work
• My bibliography: 337 entries so far
• Sample area 1: Code generation
– Generative & generic programming
– Sparse compilers
– Domain-specific generators
• Sample area 2: Empirical search-based tuning
– Kernel-centric
• linear algebra, signal processing, sorting, MPI, …
– Compiler-centric
• profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)
• Composable code generators and search spaces
• New application domains
– PageRank: multilevel block algorithms for topic-sensitive search?
• New kernels: cryptokernels
– rich mathematical structure germane to performance; lots of
hardware
• New tuning environments
– Parallel, Grid, “whole systems”
• Statistical models of application performance
– Statistical learning of concise parametric models from traces for
architectural evaluation
– Compiler/automatic derivation of parametric models
Possible Future Work
• Different Architectures
– New FP instruction sets (SSE2)
– SMP / multicore platforms
– Vector architectures
• Different Kernels
• Higher Level Algorithms
– Parallel versions of kenels, with optimized communication
– Block algorithms (eg Lanczos)
– XBLAS (extra precision)
• Produce Benchmarks
– Augment HPCC Benchmark
• Make it possible to combine optimizations easily for any kernel
• Related tuning activities (LAPACK & ScaLAPACK)
Review of Tuning by Illustration
(Extra Slides)
Splitting for Variable Blocks and Diagonals
• Decompose A = A1 + A2 + … At
–
–
–
–
Detect “canonical” structures (sampling)
Split
Tune each Ai
Improve performance and save storage
• New data structures
– Unaligned block CSR
• Relax alignment in rows & columns
– Row-segmented diagonals
Example: Variable Block Row (Matrix #12)
2.1x
over CSR
1.8x
over RB
Example: Row-Segmented Diagonals
2x
over CSR
Mixed Diagonal and Block Structure
Summary
• Automated block size selection
– Empirical modeling and search
– Register blocking for SpMV, triangular solve, ATA*x
• Not fully automated
– Given a matrix, select splittings and transformations
• Lots of combinatorial problems
– TSP reordering to create dense blocks (Pinar ’97; Moon, et
al. ’04)
Extra Slides
A Sparse Matrix You Encounter Every Day
Who am I?
I am a
Big Repository
Of useful
And useless
Facts alike.
Who am I?
(Hint: Not your e-mail
inbox.)
Problem Context
• Sparse kernels abound
– Models of buildings, cars, bridges, economies, …
– Google PageRank algorithm
• Historical trends
–
–
–
–
Sparse matrix-vector multiply (SpMV): 10% of peak
2x faster with “hand-tuning”
Tuning becoming more difficult over time
Promise of automatic tuning: PHiPAC/ATLAS, FFTW, …
• Challenges to high-performance
– Not dense linear algebra!
• Complex data structures: indirect, irregular memory access
• Performance depends strongly on run-time inputs
Key Questions, Ideas, Conclusions
• How to tune basic sparse kernels automatically?
– Empirical modeling and search
•
•
•
•
Up to 4x speedups for SpMV
1.8x for triangular solve
4x for ATA*x; 2x for A2*x
7x for multiple vectors
• What are the fundamental limits on performance?
– Kernel-, machine-, and matrix-specific upper bounds
• Achieve 75% or more for SpMV, limiting low-level tuning
• Consequences for architecture?
• General techniques for empirical search-based tuning?
– Statistical models of performance
Road Map
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Statistical models of performance
Compressed Sparse Row (CSR) Storage
Matrix-vector multiply kernel: y(i)

y(i) + A(i,j)*x(j)
for each row i
for k=ptr[i] to ptr[i+1] do
y[i] = y[i] + val[k]*x[ind[k]]
Road Map
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Statistical models of performance
Historical Trends in SpMV Performance
• The Data
–
–
–
–
Uniprocessor SpMV performance since 1987
“Untuned” and “Tuned” implementations
Cache-based superscalar micros; some vectors
LINPACK
SpMV Historical Trends: Mflop/s
Example: The Difficulty of Tuning
• n = 21216
• nnz = 1.5 M
• kernel: SpMV
• Source: NASA
structural analysis
problem
Still More Surprises
• More complicated non-zero
structure in general
Still More Surprises
• More complicated non-zero
structure in general
• Example: 3x3 blocking
– Logical grid of 3x3 cells
Historical Trends: Mixed News
• Observations
–
–
–
–
Good news: Moore’s law like behavior
Bad news: “Untuned” is 10% peak or less, worsening
Good news: “Tuned” roughly 2x better today, and improving
Bad news: Tuning is complex
– (Not really news: SpMV is not LINPACK)
• Questions
– Application: Automatic tuning?
– Architect: What machines are good for SpMV?
Road Map
• Sparse matrix-vector multiply (SpMV) in a nutshell
• Historical trends and the need for search
• Automatic tuning techniques
– SpMV [SC’02; IJHPCA ’04b]
– Sparse triangular solve (SpTS) [ICS/POHLL ’02]
– ATA*x [ICCS/WoPLA ’03]
• Upper bounds on performance
• Statistical models of performance
SPARSITY: Framework for Tuning SpMV
• SPARSITY: Automatic tuning for SpMV [Im & Yelick ’99]
– General approach
• Identify and generate implementation space
• Search space using empirical models & experiments
– Prototype library and heuristic for choosing register block size
• Also: cache-level blocking, multiple vectors
• What’s new?
– New block size selection heuristic
• Within 10% of optimal — replaces previous version
– Expanded implementation space
• Variable block splitting, diagonals, combinations
– New kernels: sparse triangular solve, ATA*x, Ar*x
Automatic Register Block Size Selection
• Selecting the r x c block size
– Off-line benchmark: characterize the machine
• Precompute Mflops(r,c) using dense matrix for each r x c
• Once per machine/architecture
– Run-time “search”: characterize the matrix
• Sample A to estimate Fill(r,c) for each r x c
– Run-time heuristic model
• Choose r, c to maximize Mflops(r,c) / Fill(r,c)
• Run-time costs
– Up to ~40 SpMVs (empirical worst case)
DGEMV
Accuracy of the Tuning Heuristics (1/4)
NOTE: “Fair” flops used (ops on explicit zeros not counted as “work”)
DGEMV
Accuracy of the Tuning Heuristics (2/4)
DGEMV
Accuracy of the Tuning Heuristics (3/4)
DGEMV
Accuracy of the Tuning Heuristics (4/4)
Road Map
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
– SC’02
• Statistical models of performance
Motivation for Upper Bounds Model
• Questions
– Speedups are good, but what is the speed limit?
• Independent of instruction scheduling, selection
– What machines are “good” for SpMV?
Upper Bounds on Performance: Blocked SpMV
• P = (flops) / (time)
– Flops = 2 * nnz(A)
• Lower bound on time: Two main assumptions
– 1. Count memory ops only (streaming)
– 2. Count only compulsory, capacity misses: ignore conflicts
• Account for line sizes
• Account for matrix size and nnz
• Charge min access “latency” ai at Li cache & amem
– e.g., Saavedra-Barrera and PMaC MAPS benchmarks

Time   a i  Hits i  a mem  Hits mem
i 1

 a1  Loads   (a i 1  a i )  Misses i  (a mem  a  )  Misses 
i 1
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Example: Bounds on Itanium 2
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
• Machine balance [Callahan ’88; McCalpin ’95]
– Balance = Peak Flop Rate / Bandwidth (flops / double)
• Ideal balance for mat-vec:  2 flops / double
– For SpMV, even less
Time  a1  Loads   (a i 1  a i )  Misses i  (a mem  a )  Misses 
i
• SpMV ~ streaming
– 1 / (avg load time to stream 1 array) ~ (bandwidth)
– “Sustained” balance = peak flops / model bandwidth
Where Does the Time Go?

Time   a i  Hits i  a mem  Hits mem
i 1
• Most time assigned to memory
• Caches “disappear” when line sizes are equal
– Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds
• What is the best we can do for SpMV?
– Limits to low-level tuning of blocked implementations
– Refinements?
• What machines are good for SpMV?
– Partial answer: balance characterization
• Architectural consequences?
– Example: Strictly increasing line sizes
Road Map
•
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Tuning other sparse kernels
Statistical models of performance
– FDO ’00; IJHPCA ’04a
Statistical Models for Automatic Tuning
• Idea 1: Statistical criterion for stopping a search
– A general search model
• Generate implementation
• Measure performance
• Repeat
– Stop when probability of being within e of optimal falls below
threshold
• Can estimate distribution on-line
• Idea 2: Statistical performance models
– Problem: Choose 1 among m implementations at run-time
– Sample performance off-line, build statistical model
Example: Select a Matmul Implementation
Example: Support Vector Classification
Road Map
•
•
•
•
•
•
•
Sparse matrix-vector multiply (SpMV) in a nutshell
Historical trends and the need for search
Automatic tuning techniques
Upper bounds on performance
Tuning other sparse kernels
Statistical models of performance
Summary and Future Work
Summary of High-Level Themes
• “Kernel-centric” optimization
– Vs. basic block, trace, path optimization, for instance
– Aggressive use of domain-specific knowledge
• Performance bounds modeling
– Evaluating software quality
– Architectural characterizations and consequences
• Empirical search
– Hybrid on-line/run-time models
• Statistical performance models
– Exploit information from sampling, measuring
Related Work
• My bibliography: 337 entries so far
• Sample area 1: Code generation
– Generative & generic programming
– Sparse compilers
– Domain-specific generators
• Sample area 2: Empirical search-based tuning
– Kernel-centric
• linear algebra, signal processing, sorting, MPI, …
– Compiler-centric
• profiling + FDO, iterative compilation, superoptimizers, selftuning compilers, continuous program optimization
Future Directions (A Bag of Flaky Ideas)
• Composable code generators and search spaces
• New application domains
– PageRank: multilevel block algorithms for topic-sensitive search?
• New kernels: cryptokernels
– rich mathematical structure germane to performance; lots of
hardware
• New tuning environments
– Parallel, Grid, “whole systems”
• Statistical models of application performance
– Statistical learning of concise parametric models from traces for
architectural evaluation
– Compiler/automatic derivation of parametric models
Acknowledgements
• Super-advisors: Jim and Kathy
• Undergraduate R.A.s: Attila, Ben, Jen, Jin, Michael,
Rajesh, Shoaib, Sriram, Tuyet-Linh
• See pages xvi—xvii of dissertation.
TSP-based Reordering: Before
(Pinar ’97;
Moon, et al ‘04)
TSP-based Reordering: After
(Pinar ’97;
Moon, et al ‘04)
Up to 2x
speedups
over CSR
Example: L2 Misses on Itanium 2
Misses measured using PAPI [Browne ’00]
Example: Distribution of Blocked Non-Zeros
Register Profile: Itanium 2
1190 Mflop/s
190 Mflop/s
Ultra 2i - 11%
72 Mflop/s
Ultra 3 - 5%
90 Mflop/s
Register Profiles: Sun and Intel x86
35 Mflop/s
Pentium III - 21%
108 Mflop/s
42 Mflop/s
50 Mflop/s
Pentium III-M - 15%
122 Mflop/s
58 Mflop/s
Power3 - 17%
252 Mflop/s
Power4 - 16%
820 Mflop/s
Register Profiles: IBM and Intel IA-64
122 Mflop/s
Itanium 1 - 8%
247 Mflop/s
107 Mflop/s
459 Mflop/s
Itanium 2 - 33%
1.2 Gflop/s
190 Mflop/s
Accurate and Efficient Adaptive Fill Estimation
• Idea: Sample matrix
– Fraction of matrix to sample: s  [0,1]
– Cost ~ O(s * nnz)
– Control cost by controlling s
• Search at run-time: the constant matters!
• Control s automatically by computing statistical
confidence intervals
– Idea: Monitor variance
• Cost of tuning
– Lower bound: convert matrix in 5 to 40 unblocked SpMVs
– Heuristic: 1 to 11 SpMVs
Sparse/Dense Partitioning for SpTS
• Partition L into sparse (L1,L2) and dense LD:
 L1

 L2
 x1   b1 
    
LD  x2   b2 
• Perform SpTS in three steps:
L1 x1  b1
(2)
bˆ2  b2  L2 x1
(3) L x  bˆ
(1)
D 2
2
• Sparsity optimizations for (1)—(2); DTRSV for (3)
• Tuning parameters: block size, size of dense triangle
SpTS Performance: Power3
Summary of SpTS and AAT*x Results
• SpTS — Similar to SpMV
– 1.8x speedups; limited benefit from low-level tuning
• AATx, ATAx
– Cache interleaving only: up to 1.6x speedups
– Reg + cache: up to 4x speedups
• 1.8x speedup over register only
– Similar heuristic; same accuracy (~ 10% optimal)
– Further from upper bounds: 60—80%
• Opportunity for better low-level tuning a la PHiPAC/ATLAS
• Matrix triple products? Ak*x?
– Preliminary work
Register Blocking: Speedup
Register Blocking: Performance
Register Blocking: Fraction of Peak
Example: Confidence Interval Estimation
Costs of Tuning
Splitting + UBCSR: Pentium III
Splitting + UBCSR: Power4
Splitting+UBCSR Storage: Power4
Example: Variable Block Row (Matrix #13)
Dense Tuning is Hard, Too
• Even dense matrix multiply can be notoriously
difficult to tune
Dense matrix multiply: surprising performance as register tile size varies.
Preliminary Results (Matrix Set 2): Itanium 2
Web/IR
Dense
FEM
FEM (var)
Bio
Econ
Stat
LP
Multiple Vector Performance
What about the Google Matrix?
• Google approach
– Approx. once a month: rank all pages using connectivity structure
• Find dominant eigenvector of a matrix
– At query-time: return list of pages ordered by rank
• Matrix: A = aG + (1-a)(1/n)uuT
– Markov model: Surfer follows link with probability a, jumps to a
random page with probability 1-a
– G is n x n connectivity matrix [n  3 billion]
• gij is non-zero if page i links to page j
• Normalized so each column sums to 1
• Very sparse: about 7—8 non-zeros per row (power law dist.)
– u is a vector of all 1 values
– Steady-state probability xi of landing on page i is solution to x = Ax
• Approximate x by power method: x = Akx0
– In practice, k  25
MAPS Benchmark Example: Power4
MAPS Benchmark Example: Itanium 2
Saavedra-Barrera Example: Ultra 2i
Summary of Results: Pentium III
Summary of Results: Pentium III (3/3)
Execution Time Breakdown (PAPI): Matrix 40
Preliminary Results (Matrix Set 1): Itanium 2
Dense
FEM
FEM (var)
Assorted
LP
Tuning Sparse Triangular Solve (SpTS)
• Compute x=L-1*b where L sparse lower triangular, x
& b dense
• L from sparse LU has rich dense substructure
– Dense trailing triangle can account for 20—90% of matrix
non-zeros
• SpTS optimizations
– Split into sparse trapezoid and dense trailing triangle
– Use tuned dense BLAS (DTRSV) on dense triangle
– Use Sparsity register blocking on sparse part
• Tuning parameters
– Size of dense trailing triangle
– Register block size
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Cache Blocked SpMV on LSI Matrix: Ultra 2i
A
10k x 255k
3.7M non-zeros
Baseline:
16 Mflop/s
Best block size
& performance:
16k x 64k
28 Mflop/s
Cache Blocking on LSI Matrix: Pentium 4
A
10k x 255k
3.7M non-zeros
Baseline:
44 Mflop/s
Best block size
& performance:
16k x 16k
210 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium
A
10k x 255k
3.7M non-zeros
Baseline:
25 Mflop/s
Best block size
& performance:
16k x 32k
72 Mflop/s
Cache Blocked SpMV on LSI Matrix: Itanium 2
A
10k x 255k
3.7M non-zeros
Baseline:
170 Mflop/s
Best block size
& performance:
16k x 65k
275 Mflop/s
Inter-Iteration Sparse Tiling (1/3)
x1
t1
y1
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
Inter-Iteration Sparse Tiling (2/3)
x1
t1
y1
x2
t2
y2
x3
x4
t3
t4
y3
y4
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
• Orange = everything needed
to compute y1
– Reuse a11, a12
x5
t5
y5
Inter-Iteration Sparse Tiling (3/3)
x1
t1
y1
x2
t2
y2
x3
x4
t3
t4
y3
y4
• [Strout, et al., ‘01]
• Let A be 6x6 tridiagonal
• Consider y=A2x
– t=Ax, y=At
• Nodes: vector elements
• Edges: matrix elements aij
• Orange = everything needed
to compute y1
– Reuse a11, a12
x5
t5
y5
• Grey = y2, y3
– Reuse a23, a33, a43
Inter-Iteration Sparse Tiling: Issues
x1
t1
y1
x2
t2
y2
x3
t3
y3
x4
t4
y4
x5
t5
y5
• Tile sizes (colored regions)
grow with no. of iterations
and increasing out-degree
– G likely to have a few nodes
with high out-degree (e.g.,
Yahoo)
• Mathematical tricks to limit
tile size?
– Judicious dropping of edges
[Ng’01]
Summary and Questions
• Need to understand matrix structure and machine
– BeBOP: suite of techniques to deal with different sparse structures
and architectures
• Google matrix problem
– Established techniques within an iteration
– Ideas for inter-iteration optimizations
– Mathematical structure of problem may help
• Questions
– Structure of G?
– What are the computational bottlenecks?
– Enabling future computations?
• E.g., topic-sensitive PageRank  multiple vector version [Haveliwala
’02]
– See www.cs.berkeley.edu/~richie/bebop/intel/google for more info,
including more complete Itanium 2 results.
Exploiting Matrix Structure
• Symmetry (numerical or structural)
– Reuse matrix entries
– Can combine with register blocking, multiple vectors, …
• Matrix splitting
– Split the matrix, e.g., into r x c and 1 x 1
– No fill overhead
• Large matrices with random structure
– E.g., Latent Semantic Indexing (LSI) matrices
– Technique: cache blocking
• Store matrix as 2i x 2j sparse submatrices
• Effective when x vector is large
• Currently, search to find fastest size
Symmetric SpMV Performance: Pentium 4
SpMV with Split Matrices: Ultra 2i
Cache Blocking on Random Matrices: Itanium
Speedup on four banded
random matrices.
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Register Blocked SpMV: Pentium III
Register Blocked SpMV: Ultra 2i
Register Blocked SpMV: Power3
Register Blocked SpMV: Itanium
Possible Optimization Techniques
• Within an iteration, i.e., computing (G+uuT)*x once
– Cache block G*x
• On linear programming matrices and matrices with random
structure (e.g., LSI), 1.5—4x speedups
• Best block size is matrix and machine dependent
– Reordering and/or splitting of G to separate dense structure
(rows, columns, blocks)
• Between iterations, e.g., (G+uuT)2x
– (G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTx
• Compute Gu, uTG, uTu once for all iterations
• G2x: Inter-iteration tiling to read G only once
Multiple Vector Performance: Itanium
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
SpTS Performance: Itanium
(See POHLL ’02 workshop paper, at ICS ’02.)
Sparse Kernels and Optimizations
• Kernels
–
–
–
–
Sparse matrix-vector multiply (SpMV): y=A*x
Sparse triangular solve (SpTS): x=T-1*b
–
–
–
–
–
–
Register blocking
Cache blocking
Multiple dense vectors (x)
A has special structure (e.g., symmetric, banded, …)
Hybrid data structures (e.g., splitting, switch-to-dense, …)
Matrix reordering
y=AAT*x, y=ATA*x
Powers (y=Ak*x), sparse triple-product (R*A*RT), …
• Optimization techniques (implementation space)
• How and when do we search?
– Off-line: Benchmark implementations
– Run-time: Estimate matrix properties, evaluate performance
models based on benchmark data
Optimizing AAT*x
• Kernel: y=AAT*x, where A is sparse, x & y dense
– Arises in linear programming, computation of SVD
– Conventional implementation: compute z=AT*x, y=A*z
• Elements of A can be reused:
 a1T

y  a1 an  
 aT
 n

n

 x   ak ( akT x )
k 1


• When ak represent blocks of columns, can apply register
blocking.
Optimized AAT*x Performance: Pentium III
Current Directions
• Applying new optimizations
– Other split data structures (variable block, diagonal, …)
– Matrix reordering to create block structure
– Structural symmetry
• New kernels (triple product RART, powers Ak, …)
• Tuning parameter selection
• Building an automatically tuned sparse matrix library
– Extending the Sparse BLAS
– Leverage existing sparse compilers as code generation
infrastructure
– More thoughts on this topic tomorrow
Related Work
• Automatic performance tuning systems
– PHiPAC [Bilmes, et al., ’97], ATLAS [Whaley & Dongarra ’98]
– FFTW [Frigo & Johnson ’98], SPIRAL [Pueschel, et al., ’00],
UHFFT [Mirkovic and Johnsson ’00]
– MPI collective operations [Vadhiyar & Dongarra ’01]
• Code generation
– FLAME [Gunnels & van de Geijn, ’01]
– Sparse compilers: [Bik ’99], Bernoulli [Pingali, et al., ’97]
– Generic programming: Blitz++ [Veldhuizen ’98], MTL [Siek &
Lumsdaine ’98], GMCL [Czarnecki, et al. ’98], …
• Sparse performance modeling
– [Temam & Jalby ’92], [White & Saddayappan ’97], [Navarro,
et al., ’96], [Heras, et al., ’99], [Fraguela, et al., ’99], …
More Related Work
• Compiler analysis, models
– CROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout
’01]
– TUNE [Chatterjee, et al.]
– Iterative compilation [O’Boyle, et al., ’98]
– Broadway compiler [Guyer & Lin, ’99]
– [Brewer ’95], ADAPT [Voss ’00]
• Sparse BLAS interfaces
–
–
–
–
BLAST Forum (Chapter 3)
NIST Sparse BLAS [Remington & Pozo ’94]; SparseLib++
SPARSKIT [Saad ’94]
Parallel Sparse BLAS [Fillipone, et al. ’96]
Context: Creating High-Performance Libraries
• Application performance dominated by a few
computational kernels
• Today: Kernels hand-tuned by vendor or user
• Performance tuning challenges
– Performance is a complicated function of kernel,
architecture, compiler, and workload
– Tedious and time-consuming
• Successful automated approaches
– Dense linear algebra: ATLAS/PHiPAC
– Signal processing: FFTW/SPIRAL/UHFFT
Cache Blocked SpMV on LSI Matrix: Itanium
Sustainable Memory Bandwidth
Multiple Vector Performance: Pentium 4
Multiple Vector Performance: Itanium
Multiple Vector Performance: Pentium 4
Optimized AAT*x Performance: Ultra 2i
Optimized AAT*x Performance: Pentium 4
Tuning Pays Off—PHiPAC
Tuning pays off – ATLAS
Extends applicability of PHIPAC; Incorporated in Matlab (with rest
Register Tile Sizes (Dense Matrix Multiply)
333 MHz Sun Ultra 2i
2-D slice of 3-D space;
implementations colorcoded by performance in
Mflop/s
16 registers, but 2-by-3 tile
size fastest
High Precision GEMV (XBLAS)
High Precision Algorithms (XBLAS)
• Double-double (High precision word represented as pair of doubles)
– Many variations on these algorithms; we currently use Bailey’s
• Exploiting Extra-wide Registers
– Suppose s(1) , … , s(n) have f-bit fractions, SUM has F>f bit fraction
– Consider following algorithm for S = Si=1,n s(i)
• Sort so that |s(1)|  |s(2)|  …  |s(n)|
• SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUM
– Theorem (D., Hida) Suppose F<2f (less than double precision)
• If n  2F-f + 1, then error  1.5 ulps
• If n = 2F-f + 2, then error  22f-F ulps (can be  1)
• If n  2F-f + 3, then error can be arbitrary (S  0 but sum = 0 )
– Examples
• s(i) double (f=53), SUM double extended (F=64)
– accurate if n  211 + 1 = 2049
• Dot product of single precision x(i) and y(i)
– s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) 
– accurate if n  216 + 1 = 65537
More Extra Slides
Opteron (1.4GHz, 2.8GFlop peak)
Nonsymmetric peak = 504 MFlops
Symmetric peak = 612 MFlops
Beat ATLAS DGEMV’s 365 Mflops
Awards
• Best Paper, Intern. Conf. Parallel Processing, 2004
– “Performance models for evaluation and automatic performance tuning of
symmetric sparse matrix-vector multiply”
• Best Student Paper, Intern. Conf. Supercomputing, Workshop
on Performance Optimization via High-Level Languages and
Libraries, 2003
– Best Student Presentation too, to Richard Vuduc
– “Automatic performance tuning and analysis of sparse triangular solve”
• Finalist, Best Student Paper, Supercomputing 2002
– To Richard Vuduc
– “Performance Optimization and Bounds for Sparse Matrix-vector Multiply”
• Best Presentation Prize, MICRO-33: 3rd ACM Workshop on
Feedback-Directed Dynamic Optimization, 2000
– To Richard Vuduc
– “Statistical Modeling of Feedback Data in an Automatic Tuning System”
Accuracy of the Tuning Heuristics (4/4)
DGEMV
Can Match DGEMV Performance
Fraction of Upper Bound Across Platforms
Achieved Performance and Machine Balance
• Machine balance [Callahan ’88; McCalpin ’95]
– Balance = Peak Flop Rate / Bandwidth (flops / double)
– Lower is better (I.e. can hope to get higher fraction of peak
flop rate)
• Ideal balance for mat-vec:  2 flops / double
– For SpMV, even less
Where Does the Time Go?

Time   a i  Hits i  a mem  Hits mem
i 1
• Most time assigned to memory
• Caches “disappear” when line sizes are equal
– Strictly increasing line sizes
Execution Time Breakdown: Matrix 40
Execution Time Breakdown (PAPI): Matrix 40
Speedups with Increasing Line Size
Summary: Performance Upper Bounds
• What is the best we can do for SpMV?
– Limits to low-level tuning of blocked implementations
– Refinements?
• What machines are good for SpMV?
– Partial answer: balance characterization
• Architectural consequences?
– Help to have strictly increasing line sizes
Evaluating algorithms and machines for SpMV
• Metrics
– Speedups
– Mflop/s (“fair” flops)
– Fraction of peak
• Questions
– Speedups are good, but what is “the best?”
• Independent of instruction scheduling, selection
• Can SpMV be further improved or not?
– What machines are “good” for SpMV?
Tuning Dense BLAS —PHiPAC
Tuning Dense BLAS– ATLAS
Extends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)
Statistical Models for Automatic Tuning
• Idea 1: Statistical criterion for stopping a search
– A general search model
• Generate implementation
• Measure performance
• Repeat
– Stop when probability of being within e of optimal falls below
threshold
• Can estimate distribution on-line
• Idea 2: Statistical performance models
– Problem: Choose 1 among m implementations at run-time
– Sample performance off-line, build statistical model
SpMV Historical Trends: Fraction of Peak
Motivation for Automatic Performance Tuning of SpMV
• Historical trends
– Sparse matrix-vector multiply (SpMV): 10% of peak or less
• Performance depends on machine, kernel, matrix
– Matrix known at run-time
– Best data structure + implementation can be surprising
• Our approach: empirical performance modeling and
algorithm search
Accuracy of the Tuning Heuristics (3/4)
DGEMV
Accuracy of the Tuning Heuristics (3/4)
Sparse CALA Summary
• Optimizing one SpMV too limiting
• Take k steps of Krylov subspace method
– GMRES, CG, Lanczos, Arnoldi
– Assume matrix “well-partitioned,” with modest surface-to-volume
ratio
– Parallel implementation
• Conventional: O(k log p) messages
• CALA: O(log p) messages - optimal
– Serial implementation
• Conventional: O(k) moves of data from slow to fast memory
• CALA: O(1) moves of data – optimal
– Can incorporate some preconditioners
• Need to be able to “compress” interactions between distant i, j
• Hierarchical, semiseparable matrices …
– Lots of speed up possible (measured and modeled)
• Price: some redundant computation
Potential Impact on Applications: T3P
• Application: accelerator design [Ko]
• 80% of time spent in SpMV
• Relevant optimization techniques
– Symmetric storage
– Register blocking
• On Single Processor Itanium 2
– 1.68x speedup
• 532 Mflops, or 15% of 3.6 GFlop peak
– 4.4x speedup with multiple (8) vectors
• 1380 Mflops, or 38% of peak
Download