PPT (2.6 MB)

advertisement
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Lattice Boltzmann Simulation
Optimization on Leading
Multicore Platforms
Samuel Williams1,2, Jonathan Carter2,
Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2
1University
2Lawrence
of California, Berkeley
Berkeley National Laboratory
samw@eecs.berkeley.edu
BIPS
BIPS
Motivation
 Multicore is the de facto solution for improving
peak performance for the next decade
 How do we ensure this applies to sustained
performance as well ?
 Processor architectures are extremely diverse
and compilers can rarely fully exploit them
 Require a HW/SW solution that guarantees
performance without completely sacrificing
productivity
Overview
BIPS
BIPS
 Examined the Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application
 Present and analyze two threaded & auto-tuned implementations
 Benchmarked performance across 5 diverse multicore microarchitectures





Intel Xeon (Clovertown)
AMD Opteron (rev.F)
Sun Niagara2 (Huron)
IBM QS20 Cell Blade (PPEs)
IBM QS20 Cell Blade (SPEs)
 We show



Auto-tuning can significantly improve application performance
Cell consistently delivers good performance and efficiency
Niagara2 delivers good performance and productivity
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
Multicore SMPs used
N
Multicore SMP Systems
Core2 Core2
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
1MB
victim
Opteron Opteron
1MB
victim
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
HT
BIPS
BIPS
SRI / crossbar
FSB
1MB
victim
SRI / crossbar
FSB
10.6 GB/s
10.6 GB/s
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
MFC
MFC
SPE
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
MFC
SPE
MFC
MFC
MFC
MFC
MFC
Crossbar Switch
EIB (Ring Network)
90 GB/s (writethru)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
XDR
EIB (Ring Network)
<20GB/s
each
direction
179 GB/s (fill)
BIF
MFC
BIF
XDR
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
SPE
21.33 GB/s (write)
SPE
SPE
SPE
SPE
SPE
SPE
42.66 GB/s (read)
25.6GB/s
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
512MB XDR DRAM
SPE
Multicore SMP Systems
BIPS
BIPS
(memory hierarchy)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
1MB
victim
Opteron Opteron
1MB
victim
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
SRI / crossbar
FSB
1MB
victim
SRI / crossbar
FSB
10.6 GB/s
10.6 GB/s
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
MFC
MFC
SPE
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
MFC
SPE
MFC
MFC
MFC
MFC
MFC
Crossbar Switch
EIB (Ring Network)
90 GB/s (writethru)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
XDR
EIB (Ring Network)
<20GB/s
each
direction
179 GB/s (fill)
BIF
MFC
BIF
XDR
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
SPE
21.33 GB/s (write)
SPE
SPE
SPE
SPE
SPE
SPE
42.66 GB/s (read)
25.6GB/s
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
512MB XDR DRAM
SPE
Multicore SMP Systems
BIPS
BIPS
(memory hierarchy)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
1MB
victim
Opteron Opteron
1MB
victim
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
SRI / crossbar
FSB
1MB
victim
SRI / crossbar
FSB
10.6 GB/s
10.6 GB/s
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
MFC
MFC
SPE
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
MFC
SPE
MFC
MFC
MFC
MFC
MFC
Crossbar Switch
EIB (Ring Network)
90 GB/s (writethru)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
XDR
EIB (Ring Network)
<20GB/s
each
direction
179 GB/s (fill)
BIF
MFC
BIF
XDR
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
SPE
21.33 GB/s (write)
SPE
SPE
SPE
SPE
SPE
SPE
42.66 GB/s (read)
25.6GB/s
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
512MB XDR DRAM
SPE
Multicore SMP Systems
BIPS
BIPS
(memory hierarchy)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
1MB
victim
Opteron Opteron
1MB
victim
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
SRI / crossbar
FSB
1MB
victim
SRI / crossbar
FSB
10.6 GB/s
10.6 GB/s
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
MFC
MFC
SPE
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
MFC
SPE
MFC
MFC
MFC
MFC
MFC
Crossbar Switch
EIB (Ring Network)
90 GB/s (writethru)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
XDR
EIB (Ring Network)
<20GB/s
each
direction
179 GB/s (fill)
BIF
MFC
BIF
XDR
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
SPE
21.33 GB/s (write)
SPE
SPE
SPE
SPE
SPE
SPE
42.66 GB/s (read)
25.6GB/s
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
512MB XDR DRAM
SPE
Multicore SMP Systems
BIPS
BIPS
(peak flops)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
1MB
victim
SRI / crossbar
17 Gflop/s
FSB
10.6 GB/s
10.6 GB/s
1MB
victim
SRI / crossbar
75 Gflop/s
FSB
1MB
victim
Opteron Opteron
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
MFC
Crossbar Switch
11 Gflop/s
90 GB/s (writethru)
PPEs: 13 Gflop/s
MFC
MFC
MFC
EIB (Ring Network)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
SPE
21.33 GB/s (write)
42.66 GB/s (read)
XDR
BIF
MFC
MFC
MFC
MFC
BIF
XDR
SPE
SPE
25.6GB/s
MFC
MFC
SPE
SPE
SPE
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
MFC
MFC
256K 256K 256K 256K
SPEs: 29 Gflop/s
SPE
SPE
EIB (Ring Network)
<20GB/s
each
direction
179 GB/s (fill)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
SPE
512MB XDR DRAM
SPE
Multicore SMP Systems
BIPS
BIPS
(peak DRAM bandwidth)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
21 GB/s(read)
10 GB/s(write)
FSB
1MB
victim
1MB
victim
SRI / crossbar
1MB
victim
SRI / crossbar
21 GB/s
FSB
10.6 GB/s
10.6 GB/s
Opteron Opteron
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
128b memory controller
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
42 GB/s(read)
21 GB/s(write)
MFC
MFC
SPE
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
MFC
SPE
MFC
MFC
MFC
MFC
MFC
Crossbar Switch
90 GB/s (writethru)
EIB (Ring Network)
179 GB/s (fill)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
MFC
MFC
MFC
MFC
256K 256K 256K 256K
<20GB/s
each
direction
51 GB/s
XDR
BIF
BIF
EIB (Ring Network)
MFC
XDR
MFC
MFC
MFC
256K 256K 256K 256K
4x128b memory controllers (2 banks each)
21.33 GB/s (write)
SPE
SPE
SPE
SPE
SPE
SPE
SPE
42.66 GB/s (read)
25.6GB/s
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
512MB XDR DRAM
SPE
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
Auto-tuning
V
I
S
I
O
N
Auto-tuning
BIPS
BIPS
 Hand optimizing each architecture/dataset combination is not
feasible
 Our auto-tuning approach finds a good performance solution by
a combination of heuristics and exhaustive search





Perl script generates many possible kernels
(Generate SIMD optimized kernels)
Auto-tuning benchmark examines kernels and reports back with
the best one for the current architecture/dataset/compiler/…
Performance depends on the optimizations generated
Heuristics are often desirable when the search space isn’t
tractable
 Proven value in Dense Linear Algebra(ATLAS),
Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
Introduction to LBMHD
O
N
Introduction to Lattice Methods
BIPS
BIPS
 Structured grid code, with a series of time steps
 Popular in CFD (allows for complex boundary conditions)
 Overlay a higher dimensional phase space



Simplified kinetic model that maintains the macroscopic quantities
Distribution functions (e.g. 5-27 velocities per point in space)
are used to reconstruct macroscopic quantities
Significant Memory capacity requirements
12
0
4
15
25
6
10
+Z
23 18
26
20
14
8
2
21
9
13 22
1
11
+Y 17
5
24
7
16
3
19
+X
LBMHD
BIPS
BIPS
(general characteristics)
 Plasma turbulence simulation
 Couples CFD with Maxwell’s equations
 Two distributions:


momentum distribution (27 scalar velocities)
magnetic distribution (15 vector velocities)
 Three macroscopic quantities:



Density
Momentum (vector)
Magnetic Field (vector)
12
12
0
4
15
+Z
+Z25
6
15
+Z25
14
8
23 18
23 18
21
10
21
26
26
20
9
20
13 22
1
+Y
+Y
macroscopic variables
13 22
11
5
17
+Y
24
7
+X
14
2
17
24
16
3
19
momentum distribution
+X
16
19
magnetic distribution
+X
LBMHD
BIPS
BIPS
(flops and bytes)
 Must read 73 doubles, and update 79 doubles per point in
space (minimum 1200 bytes)
 Requires about 1300 floating point operations per point in space
 Flop:Byte ratio


0.71 (write allocate architectures)
1.07 (ideal)
 Rule of thumb for LBMHD:

Architectures with more flops than bandwidth are likely
memory bound (e.g. Clovertown)
LBMHD
BIPS
BIPS
(implementation details)
 Data Structure choices:


Array of Structures: no spatial locality, strided access
Structure of Arrays: huge number of memory streams per
thread, but guarantees spatial locality, unit-stride, and vectorizes
well
 Parallelization



Fortran version used MPI to communicate between tasks.
= bad match for multicore
The version in this work uses pthreads for multicore, and MPI for
inter-node
MPI is not used when auto-tuning
 Two problem sizes:


643 (~330MB)
1283 (~2.5GB)
Stencil for Lattice Methods
BIPS
BIPS
 Very different the canonical heat equation stencil


There are multiple read and write arrays
There is no reuse
read_lattice[ ][ ]
write_lattice[ ][ ]
BIPS
BIPS
Side Note on Performance
Graphs
 Threads are mapped first to cores, then sockets.
i.e. multithreading, then multicore, then multisocket
 Niagara2 always used 8 threads/core.
 Show two problem sizes
 We’ll step through performance as optimizations/features are
enabled within the auto-tuner
 More colors implies more optimizations were necessary
 This allows us to compare architecture performance while
keeping programmer effort(productivity) constant
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Performance and Analysis of
Pthreads Implementation
Pthread Implementation
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
4.0
3.0

Not naïve

fully unrolled loops

NUMA-aware

1D parallelization

Always used 8 threads per
core on Niagara2

1P Niagara2 is faster than
2P x86 machines
4.0
2.0
1.0
1.0
0.0
1
2
4
8
1
64^3
8.0
AMD Opteron (rev.F)
3.0
2.0
2
4
8
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
GFlop/s
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
4.0
2
4
128^3
IBM Cell Blade (PPEs)
4.0
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
1
2
64^3
4
1
2
128^3
4
Pthread Implementation
BIPS
BIPS
8.0
Intel Xeon (Clovertown)
8.0
4.8% of peak flops
17% of bandwidth
5.0
GFlop/s
Not naïve

fully unrolled loops

NUMA-aware

1D parallelization

Always used 8 threads per
core on Niagara2

1P Niagara2 is faster than
2P x86 machines
6.0
5.0
GFlop/s
6.0
4.0
3.0
4.0
14% of peak flops
17% of bandwidth
3.0
2.0
2.0
1.0
1.0
0.0
1
2
4
8
1
64^3
2
4
8
0.0
1
4
1
64^3
8.0
54% of peak flops
14% of bandwidth
6.0
2
128^3
Sun Niagara2 (Huron)
7.0
2
4
128^3
IBM Cell Blade (PPEs)
7.0
6.0
5.0
GFlop/s
5.0
GFlop/s

7.0
7.0
8.0
AMD Opteron (rev.F)
4.0
4.0
3.0
3.0
2.0
2.0
1% of peak flops
0.3% of bandwidth
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
1
2
64^3
4
1
2
128^3
4
Initial Pthread Implementation
BIPS
BIPS
8.0
Intel Xeon (Clovertown)
8.0
7.0
7.0
6.0
6.0
4.0
3.0
1.0
0.0
1
2
4
8
1
64^3
6.0
2
4
8
0.0
1
2
128^3
4
1
64^3
Sun Niagara2
(Huron)
Performance
degradation
despite improved surface
to volume ratio
8.0
2
4
128^3
IBM Cell Blade (PPEs)
7.0
6.0
5.0
GFlop/s
5.0
GFlop/s

Always used 8 threads per
core on Niagara2

1P Niagara2 is faster than
2P x86 machines
2.0
1.0
7.0
Not naïve

fully unrolled loops

NUMA-aware

1D parallelization
4.0
3.0
2.0
8.0

5.0
Performance degradation
despite improved surface
to volume ratio
GFlop/s
GFlop/s
5.0
AMD Opteron (rev.F)
4.0
4.0
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
1
2
64^3
4
1
2
128^3
4
Cache effects
BIPS
BIPS
 Want to maintain a working set of velocities in the L1 cache
 150 arrays each trying to keep at least 1 cache line.
 Impossible with Niagara’s 1KB/thread L1 working set =
capacity misses
 On other architectures, the combination of:



Low associativity L1 caches (2 way on opteron)
Large numbers or arrays
Near power of 2 problem sizes
can result in large numbers of conflict misses
 Solution: apply a lattice (offset) aware padding heuristic to the
velocity arrays to avoid/minimize conflict misses
Auto-tuned Performance
BIPS
BIPS
8.0
(+Stencil-aware Padding)
Intel Xeon (Clovertown)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
AMD Opteron (rev.F)

GFlop/s
GFlop/s

4.0
3.0
3.0
2.0
1.0
0.0
1
2
4
8
1
64^3
2
4
8
0.0
1
2
128^3
Sun Niagara2 (Huron)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
4.0
.
3.0
4
1
64^3
GFlop/s
GFlop/s

2.0
1.0
8.0
4.0
This lattice method is
essentially 79 simultaneous
72-point stencils
Can cause conflict misses
even with highly associative L1
caches (not to mention
opteron’s 2 way)
2
4
128^3
IBM Cell Blade (PPEs)
Solution: pad each
component so that when
accessed with the
corresponding stencil(spatial)
offset, the components are
uniformly distributed in the
cache
4.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Blocking for the TLB
BIPS
BIPS
 Touching 150 different arrays will thrash TLBs with less than
128 entries.
 Try to maximize TLB page locality
 Solution: borrow a technique from compilers for vector machines:



Fuse spatial loops
Strip mine into vectors of size vector length (VL)
Interchange spatial and velocity loops
 Can be generalized by varying:


The number of velocities simultaneously accessed
The number of macroscopics / velocities simultaneously updated
 Has the side benefit expressing more ILP and DLP (SIMDization) and
cleaner loop structure at the cost of increased L1 cache traffic
Multicore SMP Systems
BIPS
BIPS
(TLB organization)
Core2 Core2
Core2 Core2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
4MB
Shared L2
Opteron Opteron
1MB
victim
16 entries
FSB
10.6 GB/s
32 entries
4KB pages
128b memory controller
4KB pages
10.66 GB/s
10.6 GB/s(write)
10.66 GB/s
667MHz DDR2 DIMMs
667MHz FBDIMMs
Sun Niagara2 (Huron)
MT
MT
MT
MT
MT
MT
MT
MT
Sparc Sparc Sparc Sparc Sparc Sparc Sparc Sparc
667MHz DDR2 DIMMs
IBM QS20 Cell Blade
SPE
SPE
SPE
SPE
PPE
PPE
512KB
L2
512KB
L2
SPE
128 entries
4MB pages
Crossbar Switch
179 GB/s (fill)
4MB Shared L2 (16 way)
(address interleaving via 8x64B banks)
4x128b memory controllers (2 banks each)
21.33 GB/s (write)
SPE
SPE
PPEs: 1024 entries
SPEs: 256 entries
MFC
MFC
MFC
MFC
EIB (Ring Network)
MFC
MFC
MFC
SPE
SPE
SPE
XDR
SPE
42.66 GB/s (read)
25.6GB/s
BIF
MFC
MFC
MFC
MFC
BIF
4KB pages
XDR
MFC
MFC
MFC
256K 256K 256K 256K
SPE
SPE
SPE
25.6GB/s
667MHz FBDIMMs
512MB XDR DRAM
MFC
EIB (Ring Network)
<20GB/s
each
direction
MFC
256K 256K 256K 256K
SPE
256K 256K 256K 256K
256K 256K 256K 256K
8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1 8K L1
90 GB/s (writethru)
1MB
victim
SRI / crossbar
128b memory controller
Chipset (4x64b controllers)
21.3 GB/s(read)
1MB
victim
SRI / crossbar
FSB
10.6 GB/s
1MB
victim
Opteron Opteron
HT
Core2 Core2
HT
Core2 Core2
AMD Opteron (rev.F)
4GB/s
(each direction)
Intel Xeon (Clovertown)
512MB XDR DRAM
SPE
Cache / TLB Tug-of-War
BIPS
BIPS
 For cache locality we want small VL
 For TLB page locality we want large VL
 Each architecture has a different balance between these two
forces
TLB miss penalty
L1 miss penalty
L1 / 1200
VL
 Solution: auto-tune to find the optimal VL
Page size / 8
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
4.0
3.0




1.0
0.0
1
2
4
8
1
64^3
2
4
8
Solution: vectorization
Fuse spatial loops,
strip mine into vectors of size
VL, and interchange with
phase dimensional loops
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
7.0
6.0
6.0
5.0
5.0
GFlop/s
7.0
4.0
Each update requires touching
~150 components, each likely
to be on a different page
TLB misses can significantly
impact performance
4.0
2.0
1.0
8.0
AMD Opteron (rev.F)
3.0
2.0
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
(+Vectorization)
2
4
128^3

IBM Cell Blade (PPEs)


Auto-tune: search for the
optimal vector length
Significant benefit on some
architectures
Becomes irrelevant when
bandwidth dominates
performance
4.0
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
+Vectorization
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
4.0
3.0
4.0
2.0
1.0
1.0
0.0
1
2
4
8
1
64^3
8.0
AMD Opteron (rev.F)
3.0
2.0
2
4
8
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
GFlop/s
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
(+Explicit Unrolling/Reordering)
4.0
2
4
128^3

Give the compilers a
helping hand for the
complex loops
 Code Generator: Perl
script to generate all power
of 2 possibilities
 Auto-tune: search for the
best unrolling and
expression of data level
parallelism
 Is essential when using
SIMD intrinsics
IBM Cell Blade (PPEs)
4.0
+Unrolling
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
+Vectorization
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
4.0
3.0
2.0
1.0
1.0
0.0
1
2
4
8
1
64^3
8.0
4.0
3.0
2.0
2
4
8
0.0
AMD Opteron (rev.F)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
4.0
Expanded the code
generator to insert
software prefetches in
case the compiler doesn’t.
Auto-tune:

no prefetch

prefetch 1 line ahead

prefetch 1 vector
ahead.
 Relatively little benefit for
relatively little work
IBM Cell Blade (PPEs)
2
4
1
64^3
Sun Niagara2 (Huron)


1
128^3
GFlop/s
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
(+Software prefetching)
2
4
128^3
+SW Prefetching
4.0
+Unrolling
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
+Vectorization
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
8.0
(+Software prefetching)
Intel Xeon (Clovertown)
8.0
7.0
7.0
6% of peak flops
22% of bandwidth
6.0
4.0
3.0

Auto-tune:

no prefetch

prefetch 1 line ahead

prefetch 1 vector ahead.
Relatively little benefit for
relatively little work
4.0
2.0
1.0

1.0
0.0
1
2
4
8
1
64^3
2
4
8
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
2
4
128^3
IBM Cell Blade (PPEs)
7.0
7.0
59% of peak flops
15% of bandwidth
6.0
6.0
5.0
GFlop/s
5.0
GFlop/s
Expanded the code generator
to insert software prefetches in
case the compiler doesn’t.
3.0
2.0
8.0

32% of peak flops
40% of bandwidth
5.0
GFlop/s
GFlop/s
5.0
6.0
AMD Opteron (rev.F)
4.0
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
+SW Prefetching
10% of peak flops
3.7% of bandwidth
4.0
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
4.0
3.0



4.0

2.0
1.0
1.0

0.0
1
2
4
8
1
64^3
8.0
AMD Opteron (rev.F)
3.0
2.0
2
4
8
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
7.0
7.0
6.0
6.0
5.0
5.0
GFlop/s
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
(+SIMDization, including non-temporal stores)
4.0
2
4
Compilers(gcc & icc) failed at
exploiting SIMD.
Expanded the code generator
to use SIMD intrinsics.
Explicit unrolling/reordering
was extremely valuable here.
Exploited movntpd to
minimize memory traffic (only
hope if memory bound)
Significant benefit for
significant work
128^3
IBM Cell Blade (PPEs)
+SIMDization
+SW Prefetching
4.0
+Unrolling
3.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
+Vectorization
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
8.0
(+SIMDization, including non-temporal stores)
Intel Xeon (Clovertown)
8.0
7.5% of peak flops
18% of bandwidth
7.0
6.0
7.0
6.0
4.0
3.0

4.0

2.0
1.0

1.0
0.0
1
2
4
8
1
64^3
2
4
8
Compilers(gcc & icc) failed at
exploiting SIMD.
Expanded the code generator
to use SIMD intrinsics.
Explicit unrolling/reordering
was extremely valuable here.
Exploited movntpd to
minimize memory traffic (only
hope if memory bound)
Significant benefit for
significant work
0.0
1
2
128^3
4
1
64^3
Sun Niagara2 (Huron)
8.0
2
4
128^3
IBM Cell Blade (PPEs)
7.0
7.0
59% of peak flops
15% of bandwidth
6.0
6.0
4.0
3.0
2.0
2.0
1.0
1.0
0.0
0.0
2
4
64^3
8
1
2
4
128^3
8
+SW Prefetching
10% of peak flops
3.7% of bandwidth
4.0
3.0
1
+SIMDization
5.0
GFlop/s
5.0
GFlop/s


3.0
2.0
8.0
42% of peak flops
35% of bandwidth
5.0
GFlop/s
GFlop/s
5.0
AMD Opteron (rev.F)
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
7.0
7.0
6.0
6.0
5.0
5.0
1.6x
4.0
3.0



4.3x
4.0

2.0
1.0

1.0
0.0
1
2
4
8
1
64^3
8.0
AMD Opteron (rev.F)
3.0
2.0
2
4
8
1
4
1
64^3
8.0
7.0
7.0
6.0
6.0
5.0
5.0
1.5x
3.0
2
128^3
Sun Niagara2 (Huron)
4.0
Compilers(gcc & icc) failed at
exploiting SIMD.
Expanded the code generator
to use SIMD intrinsics.
Explicit unrolling/reordering
was extremely valuable here.
Exploited movntpd to
minimize memory traffic (only
hope if memory bound)
Significant benefit for
significant work
0.0
GFlop/s
GFlop/s
8.0
GFlop/s
GFlop/s
8.0
(+SIMDization, including non-temporal stores)
2
4
128^3
IBM Cell Blade (PPEs)
+SIMDization
10x
4.0
3.0
+SW Prefetching
+Unrolling
+Vectorization
2.0
2.0
1.0
1.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
64^3
4
1
2
128^3
4
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
N
Performance and Analysis of
Cell Implementation
BIPS
BIPS
Cell Implementation
 Double precision implementation
 DP will severely hamper performance
 Vectorized, double buffered, but not auto-tuned
 No NUMA optimizations
 No Unrolling
 VL is fixed
 Straight to SIMD intrinsics
 Prefetching replaced by DMA list commands
 Only collision() was implemented.
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)

16.0
16.0
14.0
14.0

12.0
12.0

10.0
10.0
8.0
6.0
8.0


4.0
2.0
2.0
0.0
1
2
4
8
1
64^3
18.0

6.0
4.0
2
4
First attempt at cell
implementation.
VL, unrolling, reordering fixed
No NUMA
Exploits DMA and double
buffering to load vectors
Straight to SIMD intrinsics.
Despite the relative
performance, Cell’s DP
implementation severely
impairs performance
0.0
8
1
2
128^3
4
1
2
64^3
Sun Niagara2 (Huron)
18.0
16.0
16.0
14.0
14.0
12.0
12.0
10.0
10.0
GFlop/s
GFlop/s
AMD Opteron (rev.F)
18.0
GFlop/s
GFlop/s
18.0
(Local Store Implementation)
4
128^3
IBM Cell Blade (SPEs)*
+SIMDization
+SW Prefetching
8.0
+Unrolling
6.0
6.0
+Vectorization
4.0
4.0
2.0
2.0
0.0
0.0
8.0
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
*collision()
only
Auto-tuned Performance
BIPS
BIPS
Intel Xeon (Clovertown)
16.0
16.0
14.0
14.0
12.0
12.0
7.5% of peak flops
18% of bandwidth
10.0
8.0
6.0
10.0
8.0
1
2
4
8
1
64^3
2
4

1
4
1
2
64^3
18.0
16.0
16.0
14.0
14.0
4
128^3
IBM Cell Blade (SPEs)*
57% of peak flops
33% of bandwidth
+SIMDization
12.0
GFlop/s
10.0
2
128^3
59% of peak flops
15% of bandwidth

First attempt at cell
implementation.
VL, unrolling, reordering fixed
No NUMA
Exploits DMA and double
buffering to load vectors
Straight to SIMD intrinsics.
Despite the relative
performance, Cell’s DP
implementation severely
impairs performance
0.0
8
Sun Niagara2 (Huron)
12.0


2.0
0.0
GFlop/s
42% of peak flops
35% of bandwidth
4.0
2.0


6.0
4.0
18.0
AMD Opteron (rev.F)
18.0
GFlop/s
GFlop/s
18.0
(Local Store Implementation)
+SW Prefetching
10.0
8.0
+Unrolling
6.0
6.0
+Vectorization
4.0
4.0
2.0
2.0
0.0
0.0
8.0
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
*collision()
only
Speedup from Heterogeneity
BIPS
BIPS
Intel Xeon (Clovertown)

16.0
16.0
14.0
14.0

12.0
12.0

1.6x
10.0
8.0
6.0
8.0


4.0
2.0
2.0
0.0
1
2
4
8
1
64^3
18.0
4.3x
10.0
6.0
4.0
2
4
1
18.0
16.0
14.0
14.0
12.0
12.0
1.5x
4.0
4.0
2.0
2.0
0.0
0.0
4
64^3
8
1
2
2
4
128^3
8
4
128^3
+SIMDization
13x
8.0
6.0
2
1
IBM Cell Blade (SPEs)*
10.0
6.0
1
4
64^3
16.0
8.0
2
128^3
Sun Niagara2 (Huron)
10.0
First attempt at cell
implementation.
VL, unrolling, reordering fixed
Exploits DMA and double
buffering to load vectors
Straight to SIMD intrinsics.
Despite the relative
performance, Cell’s DP
implementation severely
impairs performance
0.0
8
GFlop/s
GFlop/s
AMD Opteron (rev.F)
18.0
GFlop/s
GFlop/s
18.0
+SW Prefetching
+Unrolling
over auto-tuned PPE
+Vectorization
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
*collision()
only
Speedup over naive
BIPS
BIPS
Intel Xeon (Clovertown)

16.0
16.0
14.0
14.0

12.0
12.0

1.6x
10.0
8.0
6.0
8.0


4.0
2.0
2.0
0.0
1
2
4
8
1
64^3
18.0
4.3x
10.0
6.0
4.0
2
4
1
18.0
16.0
14.0
14.0
12.0
12.0
1.5x
4.0
4.0
2.0
2.0
0.0
0.0
4
64^3
8
1
2
2
4
128^3
8
4
128^3
+SIMDization
130x
8.0
6.0
2
1
IBM Cell Blade (SPEs)*
10.0
6.0
1
4
64^3
16.0
8.0
2
128^3
Sun Niagara2 (Huron)
10.0
First attempt at cell
implementation.
VL, unrolling, reordering fixed
Exploits DMA and double
buffering to load vectors
Straight to SIMD intrinsics.
Despite the relative
performance, Cell’s DP
implementation severely
impairs performance
0.0
8
GFlop/s
GFlop/s
AMD Opteron (rev.F)
18.0
GFlop/s
GFlop/s
18.0
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
*collision()
only
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
Summary
I
V
I
S
I
O
N
Aggregate Performance
BIPS
BIPS
(Fully optimized)

Cell SPEs deliver the best full system performance

Although, Niagara2 delivers near comparable per socket performance
 Dual core Opteron delivers far better performance (bandwidth) than Clovertown
 Clovertown has far too little effective FSB bandwidth
LBMHD(64^3)
18.00
Opteron
Clovertown
Niagara2 (Huron)
Cell Blade
16.00
14.00
GFlop/s
12.00
10.00
8.00
6.00
4.00
2.00
0.00
1
2
3
4
5
6
7
8
9
cores
10 11 12 13 14 15 16
Parallel Efficiency
BIPS
BIPS
(average performance per thread, Fully optimized)
 Aggregate Mflop/s / #cores
 Niagara2 & Cell showed very good multicore scaling
 Clovertown showed very poor multicore scaling (FSB limited)
LBMHD(64^3)
3.00
Opteron
Clovertown
Niagara2 (Huron)
Cell Blade
GFlop/s/core
2.50
2.00
1.50
1.00
0.50
0.00
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
cores
Power Efficiency
BIPS
BIPS
(Fully Optimized)
 Used a digital power meter to measure sustained power
 Calculate power efficiency as:
sustained performance / sustained power
 All cache-based machines delivered similar power efficiency
 FBDIMMs (~12W each) sustained power
8 DIMMs on Clovertown (total of ~330W)
16 DIMMs on N2 machine (total of ~450W)
LBMHD(64^3)
70
60
MFlop/s/watt


50
40
30
20
10
0
Clovertown
Opteron
Niagara2
(Huron)
Cell Blade
BIPS
BIPS
Productivity
 Niagara2 required significantly less work to deliver
good performance (just vectorization for large
problems).
 Clovertown, Opteron, and Cell all required SIMD
(hampers productivity) for best performance.
 Cache based machines required search for some
optimizations, while Cell relied solely on heuristics
(less time to tune)
BIPS
BIPS
Summary
 Niagara2 delivered both very good performance and productivity
 Cell delivered very good performance and efficiency (processor
and power)
 On the memory bound Clovertown parallelism wins out over
optimization and auto-tuning
 Our multicore auto-tuned LBMHD implementation
significantly outperformed the already optimized serial
implementation
 Sustainable memory bandwidth is essential even on kernels
with moderate computational intensity (flop:byte ratio)
 Architectural transparency is invaluable in optimizing code
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
V
I
S
I
O
Multi-core arms race
N
New Multicores
BIPS
BIPS
2.2GHz Opteron (rev.F)
20.0
18.0
16.0
GFlop/s
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
1
2
4
1
64^3
2
4
128^3
1.40GHz Niagara2
20.0
18.0
16.0
GFlop/s
14.0
12.0
10.0
8.0
6.0
4.0
2.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
New Multicores
BIPS
BIPS
2.2GHz Opteron (rev.F)
2.3GHz
Opteron (Barcelona)
20.0
18.0
18.0
16.0
16.0
14.0
14.0
12.0
12.0
GFlop/s
GFlop/s
20.0
10.0
8.0
core Opteron
 Victoria Falls is a dual
socket (128 threads)
Niagara2
 Both have the same
total bandwidth
10.0
8.0
6.0
6.0
4.0
4.0
2.0
2.0
0.0
0.0
1
2
4
1
64^3
2
1
4
2
4
8
1
64^3
128^3
1.40GHz Niagara2
20.0
2
4
8
128^3
1.16GHz Victoria Falls
20.0
18.0
18.0
16.0
16.0
14.0
14.0
12.0
12.0
GFlop/s
GFlop/s
 Barcelona is a quad
10.0
+Smaller pages
+SIMDization
+SW Prefetching
10.0
8.0
+Unrolling
6.0
6.0
+Vectorization
4.0
4.0
2.0
2.0
0.0
0.0
8.0
1
2
4
64^3
8
1
2
4
128^3
8
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
Speedup from multicore/socket
BIPS
BIPS
2.2GHz Opteron (rev.F)
2.3GHz
Opteron (Barcelona)
20.0
18.0
18.0
16.0
16.0
14.0
14.0
10.0
8.0
core Opteron
 Victoria Falls is a dual
socket (128 threads)
Niagara2
 Both have the same
total bandwidth
12.0
10.0
8.0
(1.8x frequency normalized)
6.0
6.0
4.0
4.0
2.0
2.0
0.0
0.0
1
2
4
1
64^3
2
1
4
2
4
8
1
64^3
128^3
1.40GHz Niagara2
20.0
2
4
8
128^3
1.16GHz Victoria Falls
20.0
18.0
18.0
16.0
16.0
14.0
14.0
+Smaller pages
+SIMDization
1.6x
12.0
GFlop/s
GFlop/s
 Barcelona is a quad
1.9x
12.0
GFlop/s
GFlop/s
20.0
10.0
8.0
12.0
+SW Prefetching
10.0
+Unrolling
8.0
(1.9x frequency normalized)
6.0
6.0
4.0
4.0
2.0
2.0
0.0
0.0
1
2
4
64^3
8
1
2
4
128^3
8
+Vectorization
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
Speedup from Auto-tuning
BIPS
BIPS
2.2GHz Opteron (rev.F)
2.3GHz
Opteron (Barcelona)
20.0
18.0
18.0
16.0
16.0
14.0
14.0
12.0
12.0
GFlop/s
GFlop/s
20.0
4.3x
10.0
8.0
8.0
6.0
4.0
4.0
2.0
2.0
core Opteron
 Victoria Falls is a dual
socket (128 threads)
Niagara2
 Both have the same
total bandwidth
3.9x
10.0
6.0
0.0
0.0
1
2
4
1
64^3
2
1
4
128^3
16.0
14.0
14.0
12.0
12.0
GFlop/s
16.0
1.5x
4.0
4.0
2.0
2.0
0.0
0.0
4
64^3
8
1
2
4
128^3
2
4
8
128^3
8
+SIMDization
16x
8.0
6.0
2
1
+Smaller pages
10.0
6.0
1
8
1.16GHz Victoria Falls
18.0
8.0
4
20.0
18.0
10.0
2
64^3
1.40GHz Niagara2
20.0
GFlop/s
 Barcelona is a quad
+SW Prefetching
+Unrolling
+Vectorization
+Padding
Naïve+NUMA
1
2
4
64^3
8
16
1
2
4
128^3
8
16
BIPS
BIPS
C
O
M
P
U
T
A
T
I
O
N
A
L
R
E
S
E
A
R
C
H
D
I
Questions?
V
I
S
I
O
N
Acknowledgements
BIPS
BIPS
 UC Berkeley




 RADLab Cluster (Opterons)
 PSI cluster(Clovertowns)
Sun Microsystems
 Niagara2 donations
Forschungszentrum Jülich
 Cell blade cluster access
George Vahala, et. al
 Original version of LBMHD
ASCR Office in the DOE office of Science
 contract DE-AC02-05CH11231
Download