PPT (2.5 MB)

advertisement
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Tuning Sparse Matrix Vector
Multiplication for multi-core SMPs
Samuel Williams1,2, Richard Vuduc3, Leonid Oliker1,2,
John Shalf2, Katherine Yelick1,2, James Demmel1,2
1University
of California Berkeley
2Lawrence Berkeley National Laboratory
3Georgia Institute of Technology
samw@cs.berkeley.edu
BIPS
BIPS
Overview

Multicore is the de facto performance solution for the next decade

Examined Sparse Matrix Vector Multiplication (SpMV) kernel

Important HPC kernel

Memory intensive

Challenging for multicore

Present two autotuned threaded implementations:

Pthread, cache-based implementation

Cell local store-based implementation

Benchmarked performance across 4 diverse multicore architectures

Intel Xeon (Clovertown)

AMD Opteron

Sun Niagara2

IBM Cell Broadband Engine

Compare with leading MPI implementation(PETSc) with an autotuned
serial kernel (OSKI)
BIPS
BIPS
Sparse Matrix Vector Multiplication
 Sparse Matrix


Most entries are 0.0
Performance advantage in only
storing/operating on the nonzeros
 Requires significant meta data
A
x
 Evaluate y=Ax
 A is a sparse matrix
 x & y are dense vectors
 Challenges
 Difficult to exploit ILP(bad for superscalar),
 Difficult to exploit DLP(bad for SIMD)
 Irregular memory access to source vector
 Difficult to load balance
 Very low computational intensity (often >6 bytes/flop)
y
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
Test Suite
Dataset (Matrices)
Multicore SMPs
I
V
I
S
I
O
N
Matrices Used
BIPS
BIPS
2K x 2K Dense matrix
stored in sparse format
Dense
Well Structured
(sorted by nonzeros/row)
Protein
FEM /
Spheres
FEM /
Cantilever
FEM /
Accelerator
Circuit
webbase
Wind
Tunnel
FEM /
Harbor
Poorly Structured
hodgepodge
Extreme Aspect Ratio
(linear programming)
LP
 Pruned original SPARSITY suite down to 14
 none should fit in cache
 Subdivided them into 4 categories
 Rank ranges from 2K to 1M
QCD
FEM /
Ship
Economics Epidemiology
Multicore SMP Systems
BIPS
BIPS
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
FSB
FSB
10.6GB/s
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
4GB/s
(each direction)
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
10.6GB/s
Chipset (4x64b controllers)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
Multicore SMP Systems
BIPS
BIPS
(memory hierarchy)
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
FSB
FSB
10.6GB/s
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
4GB/s
(each direction)
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
10.6GB/s
Chipset (4x64b controllers)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
Multicore SMP Systems
BIPS
BIPS
(cache)
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
16MB
FSB
FSB
10.6GB/s
1MB
victim
1MB
victim
Opteron
Opteron
4MB
Memory Controller / HT
10.6GB/s
4GB/s
(each direction)
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
10.6GB/s
Chipset (4x64b controllers)
(vectors fit)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
4MB
(local store)
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
4MB
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
Multicore SMP Systems
BIPS
BIPS
(peak flops)
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
75 Gflop/s
(w/SIMD)
FSB
FSB
10.6GB/s
1MB
victim
1MB
victim
Opteron
1MB
victim
17 Gflop/s
Opteron
Memory Controller / HT
10.6GB/s
4GB/s
(each direction)
Opteron
1MB
victim
Opteron
Memory Controller / HT
10.6GB/s
10.6GB/s
Chipset (4x64b controllers)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
29 Gflop/s
(w/SIMD)
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
11 Gflop/s
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
Multicore SMP Systems
BIPS
BIPS
(peak read bandwidth)
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
FSB
21 GB/s
10.6GB/s
FSB
1MB
victim
1MB
victim
1MB
victim
Opteron
Opteron
21 GB/s
Opteron
10.6GB/s
10.6GB/s
Memory Controller / HT
10.6GB/s
1MB
victim
4GB/s
(each direction)
Opteron
Memory Controller / HT
Chipset (4x64b controllers)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
51 GB/s
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
43 GB/s
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
Multicore SMP Systems
BIPS
BIPS
(NUMA)
Core2 Core2
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
FSB
FSB
10.6GB/s
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
4GB/s
(each direction)
1MB
victim
1MB
victim
Opteron
Opteron
Memory Controller / HT
10.6GB/s
10.6GB/s
Chipset (4x64b controllers)
21.3 GB/s(read)
DDR2 DRAM
10.6 GB/s(write)
DDR2 DRAM
Fully Buffered DRAM
Intel Clovertown
AMD Opteron
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
FPU MT UltraSparc 8K D$
4x128b FBDIMM memory controllers
PPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
MFC 256K SPE
SPE 256K MFC
BIF
42.7GB/s (read), 21.3 GB/s (write)
25.6GB/s
XDR DRAM
Sun Niagara2
512K L2
MFC 256K SPE
XDR
Fully Buffered DRAM
PPE
<<20GB/s
each
direction
BIF
XDR
25.6GB/s
XDR DRAM
IBM Cell Blade
EIB (Ring Network)
FPU MT UltraSparc 8K D$
EIB (Ring Network)
FPU MT UltraSparc 8K D$
Crossbar Switch
FPU MT UltraSparc 8K D$
90 GB/s
(writethru)
FPU MT UltraSparc 8K D$
4MB Shared L2 (16 way)
512K L2
179 GB/s
(fill)
FPU MT UltraSparc 8K D$
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
Naïve Implementation
For cache-based machines
Included a median performance number
N
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
1.0
FEM-Har
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
0.4
Econom
0.5
FEM-Ship
0.6
QCD
0.7
0.6
FEM-Har
1.0
Tunnel
0.8
0.7
FEM-Cant
0.8
FEM-Sphr
0.9
Protein
0.9
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
1.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
vanilla C Performance
AMD Opteron
0.5
0.4
Sun Niagara2
 Vanilla C implementation
 Matrix stored in CSR (compressed
sparse row)
 Explored compiler options - only
the best is presented here
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Pthread Implementation
Optimized for multicore/threading
Variety of shared memory programming models
are acceptable(not just Pthreads)
More colors = more optimizations = more work
BIPS
BIPS
Parallelization
 Matrix partitioned by rows and balanced by the number of




nonzeros
SPMD like approach
A barrier() is called before and after the SpMV kernel
Each sub matrix stored separately in CSR
Load balancing can be challenging
x
A
y
 # of threads explored in powers of 2 (in paper)
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
3.0
FEM-Har
0.5
0.5
0.0
0.0
2.5
2.0
1.5
1.0
0.5
Sun Niagara2
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
1.0
QCD
1.5
FEM-Har
3.0
Tunnel
2.0
FEM-Cant
2.0
FEM-Sphr
2.5
Protein
2.5
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
3.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Naïve Parallel Performance
AMD Opteron
1.5
1.0
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
3.0
FEM-Har
0.5
0.0
0.0
64x threads = 41x performance
1.5
1.0
0.5
Sun Niagara2
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
1.0
FEM-Har
3.0
Tunnel
1.5
FEM-Cant
2.5
FEM-Sphr
8x cores = 1.9x performance
Protein
2.0
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
3.0
Tunnel
2.0
FEM-Cant
2.5
FEM-Sphr
0.5
Dense
GFlop/s
2.5
Protein
Dense
GFlop/s
BIPS
BIPS
Naïve Parallel Performance
AMD Opteron
4x cores = 1.5x performance
2.0
1.5
1.0
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
2.0
FEM-Har
2.5
Tunnel
3.0
FEM-Cant
0.5
0.0
0.0
25% of peak flops
39% of bandwidth
1.5
1.0
0.5
Sun Niagara2
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
1.0
2.0
FEM-Cant
2.5
FEM-Sphr
1.4% of peak flops
29% of bandwidth
3.0
Protein
1.5
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
2.0
Protein
2.5
FEM-Sphr
0.5
Dense
GFlop/s
3.0
Protein
Dense
GFlop/s
BIPS
BIPS
Naïve Parallel Performance
AMD Opteron
4% of peak flops
20% of bandwidth
1.5
1.0
Case for Autotuning
BIPS
BIPS
 How do we deliver good performance across all these
architectures, across all matrices without exhaustively
optimizing every combination
 Autotuning
 Write a Perl script that generates all possible optimizations
 Heuristically, or exhaustively search the optimizations
 Existing SpMV solution: OSKI (developed at UCB)
 This work:
 Optimizations geared for multi-core/-threading
 generates SSE/SIMD intrinsics, prefetching, loop
transformations, alternate data structures, etc…
 “prototype for parallel OSKI”
Exploiting NUMA, Affinity
BIPS
BIPS
 Bandwidth on the Opteron(and Cell) can vary
substantially based on placement of data
Opteron
Opteron
Opteron
Opteron
Opteron
Opteron
DDR2 DRAM
DDR2 DRAM
DDR2 DRAM
DDR2 DRAM
DDR2 DRAM
DDR2 DRAM
Single Thread
Multiple Threads,
One memory controller
Multiple Threads,
Both memory controllers
 Bind each sub matrix and the thread to process it
together
 Explored libnuma, Linux, and Solaris routines
 Adjacent blocks bound to adjacent cores
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
3.0
FEM-Har
0.5
0.5
0.0
0.0
2.5
1.5
1.0
0.5
Sun Niagara2
+NUMA/Affinity
Naïve Pthreads
2.0
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
1.0
QCD
1.5
FEM-Har
3.0
Tunnel
2.0
FEM-Cant
2.0
FEM-Sphr
2.5
Protein
2.5
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
3.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Performance (+NUMA)
AMD Opteron
1.5
1.0
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
3.0
FEM-Har
0.5
0.5
0.0
0.0
2.5
1.0
0.5
Sun Niagara2
+Software Prefetching
+NUMA/Affinity
2.0
1.5
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
1.0
QCD
1.5
FEM-Har
3.0
Tunnel
2.0
FEM-Cant
2.0
FEM-Sphr
2.5
Protein
2.5
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
3.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Performance (+SW Prefetching)
AMD Opteron
1.5
1.0
BIPS
BIPS
Matrix Compression
 For memory bound kernels, minimizing memory
traffic should maximize performance
 Compress the meta data
 Exploit structure to eliminate meta data
 Heuristic: select the compression that
minimizes the matrix size:
 power of 2 register blocking
 CSR/COO format
 16b/32b indices
 etc…
 Side effect: matrix may be minimized to the point where it fits
entirely in cache
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
6.0
FEM-Har
1.0
1.0
0.0
0.0
5.0
2.0
1.0
Sun Niagara2
+Matrix Compression
+Software Prefetching
4.0
3.0
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
2.0
QCD
3.0
FEM-Har
6.0
Tunnel
4.0
FEM-Cant
4.0
FEM-Sphr
5.0
Protein
5.0
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
6.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Performance (+matrix compression)
AMD Opteron
3.0
2.0
BIPS
BIPS
Cache and TLB Blocking
 Accesses to the matrix and destination vector are streaming
 But, access to the source vector can be random
 Reorganize matrix (and thus access pattern) to maximize reuse.
 Applies equally to TLB blocking (caching PTEs)
 Heuristic: block destination, then keep adding
more columns as long as the number of
source vector cache lines(or pages) touched
is less than the cache(or TLB). Apply all
previous optimizations individually to each
cache block.
 Search: neither, cache, cache&TLB
 Better locality at the expense of confusing
the hardware prefetchers.
x
A
y
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
6.0
FEM-Har
1.0
1.0
0.0
0.0
5.0
2.0
1.0
Sun Niagara2
+Cache/TLB Blocking
+Matrix Compression
4.0
3.0
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
2.0
QCD
3.0
FEM-Har
6.0
Tunnel
4.0
FEM-Cant
4.0
FEM-Sphr
5.0
Protein
5.0
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
6.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Performance (+cache blocking)
AMD Opteron
3.0
2.0
BIPS
BIPS
Banks, Ranks, and DIMMs
 In this SPMD approach, as the number of threads increases, so
to does the number of concurrent streams to memory.
 Most memory controllers have finite capability to reorder the
requests. (DMA can avoid or minimize this)
 Addressing/Bank conflicts become increasingly likely
 Add more DIMMs, configuration of ranks can help
 Clovertown system was already fully populated
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
6.0
FEM-Har
1.0
1.0
0.0
0.0
5.0
2.0
1.0
+Cache/TLB Blocking
4.0
3.0
+Matrix Compression
+Software Prefetching
+NUMA/Affinity
Naïve Pthreads
Naïve Single Thread
Sun Niagara2
+More DIMMs, Rank configuration, etc…
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
2.0
QCD
3.0
FEM-Har
6.0
Tunnel
4.0
FEM-Cant
4.0
FEM-Sphr
5.0
Protein
5.0
GFlop/s
Intel Clovertown
Dense
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
6.0
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
GFlop/s
BIPS
BIPS
Performance (more DIMMs, …)
AMD Opteron
3.0
2.0
Performance (more DIMMs, …)
+More DIMMs, Rank configuration, etc…
52% of peak flops
54% of bandwidth
4.0
+Cache/TLB Blocking
+Matrix Compression
3.0
+Software Prefetching
2.0
+NUMA/Affinity
1.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Sun Niagara2
5.0
0.0
Econom
FEM-Ship
QCD
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
0.0
FEM-Sphr
0.0
Protein
1.0
FEM-Har
2.0
1.0
6.0
GFlop/s
3.0
Tunnel
2.0
4.0
FEM-Cant
GFlop/s
3.0
Dense
GFlop/s
4.0
20% of peak flops
66% of bandwidth
5.0
FEM-Sphr
4% of peak flops
52% of bandwidth
5.0
AMD Opteron
6.0
Protein
Intel Clovertown
6.0
Dense
BIPS
BIPS
Performance (more DIMMs, …)
3.0
+More DIMMs, Rank configuration, etc…
2 essential optimizations
+Cache/TLB Blocking
4.0
+Matrix Compression
3.0
+Software Prefetching
2.0
+NUMA/Affinity
1.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
Naïve Pthreads
Naïve Single Thread
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Sun Niagara2
5.0
0.0
Econom
FEM-Ship
QCD
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
0.0
FEM-Cant
0.0
FEM-Sphr
1.0
Protein
1.0
FEM-Har
2.0
FEM-Cant
2.0
4.0
Protein
GFlop/s
3.0
Dense
GFlop/s
4.0
6.0
GFlop/s
4 essential optimizations
5.0
FEM-Sphr
3 essential optimizations
5.0
AMD Opteron
6.0
Tunnel
Intel Clovertown
6.0
Dense
BIPS
BIPS
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
Cell Implementation
Comments
Performance
N
BIPS
BIPS
Cell Implementation
 No vanilla C implementation (aside from the PPE)
 Even SIMDized double precision is extremely weak
 Scalar double precision is unbearable
 Minimum register blocking is 2x1 (SIMDizable)
 Can increase memory traffic by 66%
 Cache blocking optimization is transformed into local store blocking
 Spatial and temporal locality is captured by software when
the matrix is optimized
 In essence, the high bits of column indices are grouped into DMA
lists
 No branch prediction
 Replace branches with conditional operations
 In some cases, what were optional optimizations on cache based
machines, are requirements for correctness on Cell
 Despite the performance, Cell is still handicapped by double precision
Sun Niagara2
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Dense
GFlop/s
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Intel Clovertown
GFlop/s
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Dense
GFlop/s
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
GFlop/s
BIPS
BIPS
Performance
AMD Opteron
IBM Cell Broadband Engine
16 SPEs
8 SPEs
1 SPE
Sun Niagara2
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Median
39% of peak flops
89% of bandwidth
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Dense
GFlop/s
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Intel Clovertown
GFlop/s
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Tunnel
FEM-Cant
FEM-Sphr
Protein
Dense
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
Dense
GFlop/s
12.0
11.0
10.0
9.0
8.0
7.0
6.0
5.0
4.0
3.0
2.0
1.0
0.0
GFlop/s
BIPS
BIPS
Performance
AMD Opteron
IBM Cell Broadband Engine
16 SPEs
8 SPEs
1 SPE
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Multicore MPI Implementation
This is the default approach to programming multicore
Multicore MPI Implementation
BIPS
BIPS
 Used PETSc with shared memory MPICH
 Used OSKI (developed @ UCB) to optimize each thread
 = Highly optimized MPI
3.5
3.0
3.0
2.5
2.5
Naïve Single Thread
MPI(autotuned)
Pthreads(autotuned)
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
FEM-Har
Median
LP
Webbase
Circuit
FEM-Accel
Epidem
Econom
FEM-Ship
QCD
0.0
FEM-Har
0.0
Tunnel
0.5
FEM-Cant
0.5
FEM-Sphr
1.0
Protein
1.0
FEM-Cant
1.5
FEM-Sphr
1.5
2.0
Protein
2.0
Dense
GFlop/s
3.5
Dense
GFlop/s
AMD Opteron
4.0
Tunnel
Intel Clovertown
4.0
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
Summary
I
V
I
S
I
O
N
BIPS
BIPS
Median Performance & Efficiency

Used digital power meter to measure sustained system power

FBDIMM drives up Clovertown and Niagara2 power

Right: sustained MFlop/s / sustained Watts
 Default approach(MPI) achieves very low performance and efficiency
Median Power Efficiency
Median Performance
7.0
24.0
6.5
22.0
6.0
20.0
5.5
18.0
5.0
MFlop/s/Watt
GFlop/s
4.5
4.0
3.5
3.0
2.5
2.0
12.0
10.0
8.0
4.0
1.0
2.0
0.5
Naïve
14.0
6.0
1.5
0.0
16.0
Intel
Clovertown
AMD Opteron Sun Niagara2
PETSc+OSKI(autotuned)
IBM Cell
Pthreads(autotuned)
0.0
Naïve
Intel
Clovertown
AMD Opteron Sun Niagara2
PETSc+OSKI(autotuned)
IBM Cell
Pthreads(autotuned)
Summary
BIPS
BIPS
 Paradoxically, the most complex/advanced architectures required the
most tuning, and delivered the lowest performance.
 Most machines achieved less than 50-60% of DRAM bandwidth
 Niagara2 delivered both very good performance and productivity
 Cell delivered very good performance and efficiency




90% of memory bandwidth
High power efficiency
Easily understood performance
Extra traffic = lower performance (future work can address this)
 multicore specific autotuned implementation significantly
outperformed a state of the art MPI implementation
 Matrix compression geared towards multicore
 NUMA
 Prefetching
BIPS
BIPS
Acknowledgments
 UC Berkeley
 RADLab Cluster (Opterons)
 PSI cluster(Clovertowns)
 Sun Microsystems
 Niagara2
 Forschungszentrum Jülich
 Cell blade cluster
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
Questions?
V
I
S
I
O
N
Download