RSM accomplishments summary - Computer Engineering Research

IBM Software Group
IBM XL - Compiling for CELL
Mark Mendell
March 24, 2008
1
© 2002 IBM
© 2008 IBM Corporation
Corporation
IBM Software Group
Topics
 Quick IBM XL Static Compiler Overview
 CELL Broadband Engine
 Generating Single Instruction Multiple Data (SIMD)
 CELL “Single Source” compiler
2
ECE540
© 2008 IBM Corporation
IBM Software Group
The XL Compiler Architecture
Compile Step
Optimization
FORTRAN
FE
C++ FE
C FE
Link Step
Optimization
Wcode
Wcode
Wcode
Libraries
TPO
Wcode+
Wcode
EXE
PDF info
Partitions
Partitions
Wcode+
System
Linker
Optimized
Objects
TOBEY
…
3
ECE540
DLL
Wcode
IPA Objects
SPU
Instrumented
runs
PPU
Other
Objects
© 2008 IBM Corporation
IBM Software Group
TOBEY
(Toronto Optimizing Back-End with Yorktown)
 Low Level (Machine Level ) Optimizer
 Traditional Optimizations
Value Numbering
Constant Propagation
Commoning
Unrolling
Inlining
Strength Reduction (Reassociation)
Dead code
Dead Store
Many other optimizations
 Scheduling
Global, Local (before register allocation, after register allocation)
Superblock
Swing Modulo




Register Allocation
Prologue/Epilogue
Assembler or Object Output
Object Listing
4
ECE540
© 2008 IBM Corporation
IBM Software Group
TPO (Toronto Portable Optimizer)
 High Level Optimizer
 Works on WCODE (input and output)
 IPA (Interprocedural Analysis)
 optimizes across and entire application rather than one file at a time
 PDF (Profile-Directed Feedback)
 gathers information from sample runs and retunes optimization accordingly
 SMP (Symmetric Multiprocessing)
 optimization including automatically parallelizing single-threaded
applications
 HOT (High Order Transformations)
 loop optimizations to improve cache utilization
 SIMD (Single Instruction – Multiple Data)
 replace scalar code with SIMD code
5
ECE540
© 2008 IBM Corporation
IBM Software Group
More TPO Optimizations
 Traditional data flow optimizations
 Loop analysis and transformation
Parallelization (Open MP)
SIMD exploitation
 Whole program optimization
Data reorganization
 Data shape and affinity analysis
 Splitting/grouping/compression/interleaving
 Profile-directed optimization
 Value Range Propagation (VRP)
Keeps track of relative values of expressions in a program:
 For example “x<y+1” or “x!=0”
 Automatic Parallelization
6
ECE540
© 2008 IBM Corporation
IBM Software Group
Cell Broadband Engine
 Multiprocessor on a chip
 Power Proc. Element (PPE)
 general purpose
 running full-fledged OSs
 Synergistic Proc. Element (SPE)
 optimized for compute density
 Performance is achieved by
parallelizing across all the
heterogeneous processing
elements
7
ECE540
© 2008 IBM Corporation
IBM Software Group
Cell Broadband Engine
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
 Heterogeneous, multi-core engine
 1 multi-threaded power processor
 up to 8 compute-intensive-ISA engines
 Local Memories
Element Interconnect Bus (96 Bytes/cycle)
 fast access to 256KB local memories
8 Bytes
(per dir)
8
16Bytes
(one dir)
ECE540
To External IO
To External Mem
L2
Power
Processor
Element
(PPE)
L1
 globally coherent DMA to transfer data
 Pervasive SIMD
 PPE has VMX
 SPEs are SIMD-only engines
 High bandwidth
128Bytes
(one dir)




fast internal bus (200GB/s)
dual XDRTM controller (25.6GB/s)
two configurable interfaces (76.8GB/s)
numbers based on 3.2GHz clock rate
© 2008 IBM Corporation
IBM Software Group
Outline
Automatic tuning for each ISA
9
Explicit SIMD coding
ECE540
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 2:
Automatic simdization
Part 3:
Shared memory &
Single program abstr.
SIMD/alignment
directives
Automatic simdization
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
© 2008 IBM Corporation
IBM Software Group
SPE Features Optimized for by the Compiler
Synergistic Processing Element (SPE)

SIMD-only functional units
 16-bytes register/memory accesses
Even Pipe
Floating/
Fixed
Point
1
Odd Pipe
Branch
Memory
Permute
Dual-Issue
Instruction
Logic

 no hardware branch predictor
 compiler managed hint/predication
Instr.Buffer
Register File
(3.5 x 32 instr)
(128 x 16Byte register)

 must be parallel & properly aligned
Local Store
(256 KByte, Single Ported)
3
DMA
(Globally-Coherent)
10
16 bytes
(one dir)
ECE540
Dual-issue for instructions
 full dependence check in hardware
2
8 bytes
(per dir)
Simplified branch architecture
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3

Single-ported local memory
 aligned accesses only
 contentions alleviated by compiler
128 bytes
(one dir)
© 2008 IBM Corporation
IBM Software Group
Feature #1: SPE’s Functional Units are SIMD Only
SPE
 Functional units are SIMD only
 all transfers are 16 Bytes wide,
Even Pipe
Floating/
Fixed
Point
1
Odd Pipe
Branch
Memory
Permute
Dual-Issue
Instruction
Logic
 including register file and memory
 How do we handle scalar code?
Instr.Buffer
Register File
(3.5 x 32 instr)
(128 x 16Byte register)
2
Local Store
(256 KByte, Single Ported)
3
DMA
(Globally-Coherent)
8 bytes
(per dir)
11
16 bytes
(one dir)
ECE540
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128 bytes
(one dir)
© 2008 IBM Corporation
IBM Software Group
Single Instruction Multiple Data (SIMD)
Meant to process multiple “b[i]+c[i]” data per operations
16-byte boundaries
b0
b1
b2
b3
b4
b5
b6
b7
b8
memory
stream
b9 b10
registers
R1
b0
b1
b2
b3
VADD
R2
c0
c1
c2
c3
c0
c1
c2
c3
c4
c5
c6
b0+ b1+ b2+ b3+
c0 c1 c2 c3
c7
c8
c9
c10
R3
memory
stream
16-byte boundaries
12
ECE540
© 2008 IBM Corporation
IBM Software Group
Scalar code on Scalar Functional Units
 Example: a[2] = b[1] + c[3]
b-1
b0
b1
b2
b3
b4
b5
b6
b7
LOAD b[1]
c-1
c0
c1
c2
c3
c4
c5
c6
b1
r1
b1
c3
r2
b1+
b1
c3
r3
c7
LOAD c[3]
ADD
STORE a[2]
a-1
13
a0
ECE540
a1
b1+
a2
c3 a3
a4
a5
a6
a7
© 2008 IBM Corporation
IBM Software Group
Scalar Code on SIMD Functional Units
 Example: a[2] = b[1] + c[3]
b-1
b0
b1
b2
b3
b4
c0
c1
b6
b7
16-byte
boundaries
LOAD b[1]
c-1
b5
c2
c3
c4
c5
c6
b0
b1
b2
b3
r1
c0
c1
c2
b1
c3
r2
b0+ b1+ b3+ b3+
c0 c1 c3 c3
r3
c7
LOAD c[3]
ADD
Problem #1:
Memory alignment defines
data location in register
Problem #2:
Adding aligned values
yield wrong result
STORE a[2]
a-1
14
b0+ b1+ b2+
b3+
a2
a0
a1
c0 c1 c2 a3
c3 a4
ECE540
a5
a6
a7
Problem #3:
Vector store clobbers
neighboring values
© 2008 IBM Corporation
IBM Software Group
Scalar Load Handling
 Use read-rotate sequence
b-1 b0 b1 b2 b3 b4
LOAD b[1]
b0 b1 b2 b3
r1
b1 b2 b3 b0
r1’
ROTATE &b[1]
 Overhead (1 op, in blue)
one quad-word byte rotate
 Outcome
desired scalar value always in the first slot of the register
this addresses Problems 1 & 2
15
ECE540
© 2008 IBM Corporation
IBM Software Group
Scalar Store Handling
 Use read-modify-write sequence
a-1 a0 a1 a2 a3 a4
LOAD a[2]
b1+
b1
c3
* * *
Computations
r3
CWD &a[2]
SHUFFLE
Generate proper
insertion mask
for &a[2]
STORE a[2]
b1+
a-1 a0 a1 b1
c3 a3 a4
 Overhead (1 to 3 ops, in blue)
one shuffle byte, one mask formation (may reuse), one load (may reuse)
 Outcome
SIMD store does not clobber memory (this addresses Problem 3)
16
ECE540
© 2008 IBM Corporation
IBM Software Group
Optimizations for Scalar on SIMD
 Significant overhead for scalar load/store can be lowered
 For vectorizable code
generate SIMD code directly to fully utilize SIMD units
done by expert programmers or compilers
 For scalar variable
allocate scalar variables in first slot, by themselves
i * * *
eliminate need for rotate when loading
 data is guaranteed to be in first slot (Problems 1 & 2)
eliminate need for read-modify-write when storing
 other data in 16-byte line is garbage (Problem 3)
 wastes space !!
17
ECE540
© 2008 IBM Corporation
IBM Software Group
Feature #2: Software-Assisted Branch Architecture
 Branch architecture
SPE
 no hardware branch-predictor, but:
Even Pipe
Floating/
Fixed
Point
1
Odd Pipe
Branch
Memory
Permute
Dual-Issue
Instruction
Logic
Instr. Buffer
Register File
(3.5 x 32 instr)
(128 x 16Byte register)
 compare/select ops for
predication
 software-managed branch-hint
 one hint active at a time
 Lowering overhead by
 predicating small if-then-else
2
 hinting predictably taken branches
Local Store
(256 KByte, Single Ported)
3
DMA
(Globally-Coherent)
8 bytes
(per dir)
18
16 bytes
(one dir)
ECE540
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128 bytes
(one dir)
© 2008 IBM Corporation
IBM Software Group
Feature #3: Software-Assisted Instruction Issue
 Dual-issue for Instructions
SPE
 can dual-issue parallel instructions
 code layout constrains dual issuing
Even Pipe
Floating/
Fixed
Point
1
Odd Pipe
Branch
Memory
Permute
Dual-Issue
Instruction
Logic
Instr.Buffer
Register File
(3.5 x 32 instr)
(128 x 16Byte register)
2
 full dependence check in hardware
 Alleviate constraints by
 making the scheduler aware of
code layout issue
Local Store
(256 KByte, Single Ported)
3
DMA
(Globally-Coherent)
8 Bytes
(per dir)
19
16Bytes
(one dir)
ECE540
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128Bytes
(one dir)
© 2008 IBM Corporation
IBM Software Group
Alleviating Issue Restriction
 Scheduling to find the best possible schedule
dependence graph modified to account for latency of false dependences
 Bundling ensure code layout restrictions
keep track of even/odd code layout at all times
swap parallel ops when needed
insert (even or odd) nops when needed
 Engineering issues
each function must start at known even/odd code layout boundary
one cannot add any instructions after the last scheduling phase as it would impact the
code layout and thus the dual-issuing constraints
20
ECE540
© 2008 IBM Corporation
IBM Software Group
Feature #4: Single-Ported Local Memory
 Local store is single ported
SPE
 denser hardware
 asymmetric port
Even Pipe
Floating/
Fixed
Point
1
Odd Pipe
Branch
Memory
Permute
Dual-Issue
Instruction
Logic
 static priority
Instr.Buffer
Register File
 If we are not careful, we may
starve for instructions
2
Local Store
(256 KByte, Single Ported)
Ifetch
3
(Globally-Coherent)
8 Bytes
(per dir)
21
16Bytes
(one dir)
ECE540
 DMA > MEM > IFETCH
(3.5 x 32 instr)
(128 x 16Byte register)
DMA
 16 bytes for load/store ops
 128 bytes for IFETCH/DMA
branch: 1,2
branch hint: 1,2
instr. fetch: 2
dma request: 3
128Bytes
(one dir)
© 2008 IBM Corporation
IBM Software Group
Hinting Branches & Instruction Starvation Prevention
 SPE provides a HINT operation
Dual-Issue
Instruction
Logic
 fetches the branch target into HINT buffer
 no penalty for correctly predicted branches
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
instruction buffers
22
ECE540
HINT buffer
HINT br, target
IFETCH
window
refill
latency
fetches ops from target;
needs a min of 15 cycles
and 8 intervening ops
BRANCH if true
target
 compiler inserts hints when beneficial
 Impact on instruction starvation
 after a correctly hinted branch, IFETCH
window is smaller
© 2008 IBM Corporation
IBM Software Group
SPE Optimization Results (Kernels)
Relative reductions in execution time
1.0
0.9
0.8
0.7
0.6
0.5
Original
+Bundle
+Branch Hint
+ Ifetch
single SPE performance, optimized, simdized code
23
ECE540
Av
er
ag
e
Sa
xp
y
ul
t
at
M
M
ne
ra
yX
Y
O
C
on
vo
lu
tio
n
ck
Li
np
a
VL
D
LU
EA
ID
FF
T
H
uf
fm
an
0.4
(avg 1.00 → 0.78)
© 2008 IBM Corporation
IBM Software Group
Outline
Automatic tuning for each ISA
24
Explicit SIMD coding
ECE540
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 2:
Automatic simdization
Part 3:
Shared memory &
Single program abstr.
SIMD/alignment
directives
Automatic simdization
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
© 2008 IBM Corporation
IBM Software Group
Successful Simdization
Extract Parallelism
Satisfy Constraints
loop level
alignment constraints
for (i=0; i<256; i++)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
a[i] =
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b1 b2 b3 b4
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
b0b1b2b3b4b5b6b7b8b9b10
R1 b0b1b2b3
R2 c0c1c2c3
data size conversion
load b[i]
b0+
b1+
b2+
b3+
c0c1c2c3 R3
SHORT
+
c0c1c2c3c4c5c6c7c8c9c10
load a[i]
unpack
unpack
add
load a[i+4]
add
store
store
INT 1
entire short loop
for (i=0; i<8; i++)
a[i] =
multiple targets
GENERIC
VMX
25
ECE540
INT 2
SPE
© 2008 IBM Corporation
IBM Software Group
Example of SIMD-Parallelism Extraction
loop level
for (i=0; i<256; i++)
a[i] =
 Loop level
 SIMD for a single statement across consecutive
iterations
 successful at:
basic-block level
a[i+0] =
 efficiently handling misaligned data
 pattern recognition (reduction, linear recursion)
 leverage loop transformations in most compilers
a[i+1] =
a[i+2] =
a[i+3] =
entire short loop
for (i=0; i<8; i++)
a[i] =
26
ECE540
[Bik et al, IJPP 2002]
[VAST compiler, 2004]
[Eichenberger et al, PLDI 2004] [Wu et al, CGO 2005]
[Naishlos, GCC Developer’s Summit 2004]
© 2008 IBM Corporation
IBM Software Group
Example of SIMD Constraints
 Alignment in SIMD units matters:
alignment constraints
 consider “b[i+1] + c[i+0]”
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
16-byte boundaries
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b0 b1 b2 b3 b4 b5 b6 b7
b1 b2 b3 b4
data size conversion
vload b[1]
this is not
b[1] + c[0]
b1 b2 b3
R1 b0 b2
+
b0+ b1 b2+ b3+
c0 +c1 c2 c3
load b[i]
SHORT
load a[i]
unpack
unpack
add
load a[i+4]
add
store
store
INT 1
INT 2
R2 c0 c1 c2 c3
multiple targets
GENERIC
c0 c1 c2 c3 c4 c5 c6 c7
VMX
SPE
16-byte boundaries
27
ECE540
© 2008 IBM Corporation
IBM Software Group
Example of SIMD Constraints (cont.)
 Alignment in SIMD units matters
alignment constraints
 when alignments within inputs do not match
 must realign the data
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
b0 b1 b2 b3 b4 b5 b6 b7
b1 b2 b3 b4
data size conversion
shuffle
load b[i]
SHORT
load a[i]
R1 b1 b2 b3 b4
+
b1+ b2+ b3+ b4+
c0 c1 c2 c3
unpack
unpack
load a[i+4]
add
add
store
store
INT 1
INT 2
R2 c0 c1 c2 c3
multiple targets
GENERIC
c0 c1 c2 c3 c4 c5 c6 c7
VMX
SPE
16-byte boundaries
28
ECE540
© 2008 IBM Corporation
IBM Software Group
Automatic Simdization for Cell
 Integrated Approach
loop level
alignment constraints
for (i=0; i<256; i++)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ...
16-byte boundaries
 extract at multiple levels
a[i] =
vload b[1]
vload b[5]
b0 b1 b2 b3
b4 b5 b6 b7
vpermute
 satisfy all SIMD constraints
 use “virtual SIMD vector” as glue
b1 b2 b3 b4
basic-block level
a[i+0] =
a[i+1] =
a[i+2] =
a[i+3] =
b0b1b2b3b4b5b6b7b8b9b10
R1 b0b1b2b3
R2 c0c1c2c3
data size conversion
load b[i]
b0+
b1+
b2+
b3+
c0c1c2c3 R3
SHORT
+
c0c1c2c3c4c5c6c7c8c9c10
load a[i]
unpack
unpack
add
load a[i+4]
add
store
store
INT 1
entire short loop
 Minimize alignment overhead
 lazily insert data reorganization
for (i=0; i<8; i++)
a[i] =
INT 2
multiple targets
GENERIC
VMX SPU BG/L
 handle compile time & runtime alignment
 simdize prologue/epilogue for SPEs
 memory accesses are always safe on SPE
 Full throughput computations
 even in presence of data conversions
 manually unrolled loops...
29
ECE540
© 2008 IBM Corporation
IBM Software Group
A Unified Simdization Framework
Global information gathering
Pointer Analysis
Alignment Analysis
Constant Propagation
…
General Transformation for SIMD
Dependence Elimination
Simdization
Data Layout Optimization
Idiom Recognition
Diagnostic
output
Simdization
Straightline-code Simdization
Loop-level Simdization
architecture independent
architecture specific
BG
VMX
SIMD Intrinsic Generator
CELL
30
ECE540
© 2008 IBM Corporation
IBM Software Group
SPE Simdization Results (Kernels)
30
25.3
26.2
Speedup factors
25
20
15
11.4
9.9
10
7.5
8.1
5
2.4
2.5
2.9
2.9
Linpack
Swim-l2
FIR
Autcor
0
Dot
Checksum Alpha
Product
Blending
Saxpy
Mat Mult
Average
single SPE, optimized, automatic simdization vs. scalar code
31
ECE540
© 2008 IBM Corporation
IBM Software Group
Example Program – SIMD (noopt)
float a[1000], b[1000], c[1000];
int main() {
int i;
for (i = 0; i < 1000; i++)
a[i] = b[i] + c[i];
}
Compile:
spuxlc –S t.c
32
ECE540
.LC__3:
ila
lqd
shli
lqx
rotqby
ila
lqx
rotqby
fa
ila
lqx
cwx
shufb
stqx
lqd
ai
lqd
cwd
shufb
stqd
il
cgt
brnz
$2,b
$3,32($1)
$4,$3,2
$2,$2,$4
$2,$2,$4
$3,c
$3,$3,$4
$3,$3,$4
$2,$2,$3
$3,a
$5,$3,$4
$6,$4,$3
$2,$2,$5,$6
$2,$3,$4
$2,32($1)
$3,$2,1
$2,32($1)
$4,0($1)
$2,$3,$2,$4
$2,32($1)
$2,1000
$2,$2,$3
$2,.LC__3
© 2008 IBM Corporation
IBM Software Group
Example Program – SIMD (O2)
float a[1000], b[1000], c[1000];
int main() {
int i;
for (i = 0; i < 1000; i++)
a[i] = b[i] + c[i];
}
Compile:
spuxlc –S –O2 t.c
33
ECE540
.LC__3:
ai
lqx
lqx
lqx
cwx
rotqby
rotqby
fa
shufb
stqx
ai
brnz
$5,$5,-1
$8,$2,$7
$9,$4,$7
$10,$6,$7
$11,$6,$7
$8,$8,$7
$9,$9,$7
$8,$8,$9
$8,$8,$10,$11
$8,$6,$7
$7,$7,4
$5,.LC__3
© 2008 IBM Corporation
IBM Software Group
Example Program – SIMD (O3 –qhot=SIMD)
float a[1000], b[1000], c[1000];
int main() {
int i;
for (i = 0; i < 1000; i++)
a[i] = b[i] + c[i];
}
il
hbrr
ila
ila
ila
il
lnop
$5,250
.LC__20,.LC__3
$2,a
$4,b
$6,c
$9,0
ai
lqx
lqx
fa
nop
stqx
ai
$5,$5,-1
$7,$4,$9
$8,$6,$9
$7,$7,$8
$1
$7,$2,$9
$9,$9,16
brnz
$5,.LC__3
.LC__3:
Compile:
spuxlc –S –O3 –qhot t.c
Unrolling & modulo scheduling
disabled
.LC__20:
34
ECE540
© 2008 IBM Corporation
IBM Software Group
SIMD Report
 -qreport
Examine loop <1> on line 4 in file "t.c"
Peeling scheme: peel simd statements for single align
with the following characteristics:
Prologue
0 blocked loops
with max trip count of 0
Main loop
orig ub is 1000u
new ub is 1000u
Epilogue
0 blocked loops
with max trip count of 0
(simdizable) []
35
ECE540
© 2008 IBM Corporation
IBM Software Group
SIMD report (cont’d)
 float a[1000], b[1001], c[1000];
int main() {
int i;
for (i = 0; i < 1000; i++)
a[i] = b[i+1] + c[i];
}
 …
(simdizable) [misalign() shift(1 compile-time)]
36
ECE540
© 2008 IBM Corporation
IBM Software Group
SIMD report (cont’d)
 float a[1001], c[1000];
int main() {
int i;
for (i = 0; i < 1000; i++)
a[i] = a[i-1] + c[i];
}
 …
recurrence on self: a[]0{6}:(flow):(1 )
5 | a[]0[$.CIV0] = a[]0[$.CIV0 - 1] + c[]0[$.CIV0];
(non_simdizable)
37
ECE540
© 2008 IBM Corporation
IBM Software Group
Single Source Compiler
38
ECE540
© 2008 IBM Corporation
IBM Software Group
Outline
Automatic tuning for each ISA
39
Explicit SIMD coding
ECE540
Explicit parallelization with
local memories
SIMD
PROGRAMS
Multiple-ISA hand-tuned
programs
Part 2:
Automatic simdization
Part 3:
Shared memory &
Single program abstr.
SIMD/alignment
directives
Automatic simdization
PARALLELIZATION
Part 1:
Automatic SPE tuning
Shared memory,
single program
abstraction
Automatic parallelization
© 2008 IBM Corporation
IBM Software Group
Cell Memory & DMA Architecture
 Local stores are mapped in
global address space
Main
Memory*
SPE #1
MMU
Local
LocalStore
store#1
1
Local store 1
SPU
 can access/DMA memory
...
 set access rights
ALIAS TO
SPE #8
MMU
LocalStore
store#8
8
Local
Local store 1
SPU
 SPE can initiate DMAs
 to any global addresses,
TLBs
MFC Regs
 including local stores of others.
L1
QofS* / L3*
PPE
 translation done by MMU
L2
IO Devices
Memory requests
* external
DMA
Mapped
40
 PPE
ECE540
 Note
 all elements may be masters,
there are no designated slaves
© 2008 IBM Corporation
IBM Software Group
Dual Source Compilation of a Cell Program
Manual Compiling & Binding
SPE
Object
PPE
Source
PPE Compiler
PPE
Source
PPE
program
PPE
Object
PPE Linker
SPE
Libraries
SPE
Exec
SPE Embedder
SPE
Object
SPE Linker
SPE
Source
SPE Compiler
SPE
Source
Executable
SPE
code
Data
PPE
Object
PPE
Object
Memory
Image
PPE
Libraries
41
ECE540
© 2008 IBM Corporation
IBM Software Group
SPE
SPE
SPE
SPE
SPE
SPE
SPE
SPE
Anatomy of a Cell Program
PPE
program
for (i=0; i<10K; i++)
A[i] = B[i] + C[i]
Program
SPE
code
To External Mem
Power
Processor
Element
(PPE)
Element Interconnect Bus (96 Bytes/cycle)
Cell
A. Invoke PPE program
B. SPE code “loop(lb, ub)
A1. Invoke thread lib to start threads
B1. dma_get B,C[lb : ub];
A2. Load SPE code “loop1” and initiate
B2. for ( i=lb; i<ub; i++)
A3. Wait for SPE to finish
A[i] = B[i] + C[i];
B3. dma_put A[lb : ub];
42
ECE540
Data
Memory
Image
© 2008 IBM Corporation
IBM Software Group
“Single Source” Compiler
 User prepares an application as a collection of one or more source files
containing OpenMP pragmas
 Compiler uses pragmas to partition code between PPE and SPE
 Compiler handles data transfers.
identify accesses in SPE functions that refer to data in system memory locations
use static buffers or software cache to transfer his data to/from SPE local stores
 Compiler handles code size
Use Code partitioning for Single Source
 Automatic partitioning based on relationships and size
43
ECE540
© 2008 IBM Corporation
IBM Software Group
“Single Source” Compilation of a Cell Program
Single Source Prog.
PPE
Source
PPE Compiler
PPE
Source
PPE
program
PPE
Object
PPE Linker
SPE
Libraries
SPE
Exec
SPE Embedder
SPE
Object
SPE Linker
SPE
Source
SPE
Object
Executable
SPE
code
Data
PPE
Object
PPE
Object
Memory
Image
PPE
Libraries
Single
Source
44
SPE
Source
SPE Compiler
Architecture-Independent Compiler
#pragma OMP parallel for
for( i =0; i<10000; i++)
A[i] = B[i] + C[i];
Single
Source
Automatic Compiling & Binding
ECE540
© 2008 IBM Corporation
IBM Software Group
Compiling a single source file for the cell
foo3(LB,UB)
Single source
foo1 ();
#pragma omp parallel for
for (i=0; i < N; i++)
A[i] = x * B[i];
for (i=LB; i < UB; i++)
A[i] = x * B[i];
Runtime barrier
foo3_SPU (LB,UB)
for (i=LB; i < UB; i++)
A[i] = x * B[i];
foo2 ();
Runtime barrier
foo1 ();
Runtime distribution of work:
invoke foo3_SPU, for i=[0,N)
Runtime barrier
foo2 ();
45
ECE540
In SPE code:
A, B, and x are shared
© 2008 IBM Corporation
IBM Software Group
Compiling a single source file for the cell
foo3(LB,UB)
Single source
foo1 ();
#pragma omp parallel for
for (i=0; i < N; i++)
A[i] = x * B[i];
foo3_SPU (LB,UB)
/** buffers A´[M], B´[M] **/
foo2 ();
foo1 ();
Runtime distribution of work:
invoke foo3_SPU,
for i=[0,N)
Runtime barrier
foo2 ();
46
for (i=LB; i < UB; i++)
A[i] = x * B[i];
Runtime barrier
ECE540
for ( k=LB; k < UB; k+=M) {
DMA M elements of B into B´
for (j=0; j<M; j++) {
A´[j] = cache_lookup(x) * B´[j];
}
DMA M elements of A out of A´
}
Runtime barrier
© 2008 IBM Corporation
IBM Software Group
Using OpenMP to partition/parallelize across Cell
 A single source program contains C, C++ or Fortran with OpenMP user
directives or pragmas
 Compiler “outlines” all code within the pragmas into separate functions
compiled for the SPE.
 Replaces outlined code with call to the parallel runtime and compiles this code
for the PPE
 Master thread executes on PPE
 PPE Runtime
 places outlined functions on a work queue containing information about
number of iterations to execute, or ‘chunk’ size for each SPE
 Creates up to 16 SPE threads to pull work items (outlined parallel functions)
from queue and execute on SPEs
 May wait for SPE completion, or proceed with other PPE statement execution
47
ECE540
© 2008 IBM Corporation
IBM Software Group
Why OpenMP directives ?
 Reasonable acceptance in the industry – growing with the
increasing ubiquity of multi-core System on a Chip (SOC)
 Allows us to sidestep the issues of auto-parallelization
detection – for now
 Simplifies memory consistency issues – adhere to OpenMP
shared memory, relaxed consistency model
 May be extensible to address future accelerator specific
features
 Provides a path to fully automatic approach based on
underlying compiler support
48
ECE540
© 2008 IBM Corporation
IBM Software Group
PPE Runtime
 First OMP construct initializes the runtime system
create SPE threads and loads the SPE runtime
create work queue and get DMA queue addresses
send address of work queue to each SPE
set global options
 Sends a “setup_done” to SPEs after partitioning/scheduling the work items
 Parallel regions all run on SPEs
...
omp_rte_init();
omp_rte_do_par(ol$1);
...
master thread
void ol$1_PPE(LB, UB)
for( i=LB; i<UB; i++)
A[i] = B[i] + D[ C[i] ];
PPE worker thread
PPE worker thread (optional)
49
ECE540
© 2008 IBM Corporation
IBM Software Group
SPE Runtime
 Infinite loop waiting for signals from PPE runtime
 DMA
fetches work items from work queue in system memory
 Depending on the work type:
translates the address of SPE outlined procedure from PPE outlined procedure
invokes SPE outlined procedures.
Loop continuously
Looking for work
50
ECE540
© 2008 IBM Corporation
IBM Software Group
Runtime Interaction
...
omp_rte_init();
omp_rte_do_par(ol$1);
...
master thread
Software
cache
PPE RUNTIME
•Partitioning
•Scheduling
•Synchronization
•Communication
SYSTEM
SPE RUNTIME
•Perform work items
•Communication
MEMORY
void ol$1_SPE(LB, UB)
for( k=LB; k<UB; k+=100)
DMA 100 B,C elements into B’,C’
for ( i=0; i<100; i++)
A’[i] = B’[i] + cache_lookup(D[ C’[i] ]);
DMA 100 A elements out of A’
51
ECE540
© 2008 IBM Corporation
IBM Software Group
Competing for the SPE Local-Store
Local store is fast, needs support when full.
irregular data
Provide compiler support:
 SPE code too large
 compiler partitions code
 partition manager pulls in code as needed
code
regular
data
Local
Store
 Data with regular accesses is too large
 compiler stages data in & out
 using static buffering
 can hide latencies by using double buffering
 Data with irregular accesses is present
 e.g. indirection, runtime pointers...
 use a software cache approach to pull the data in & out (last resort solution)
52
ECE540
© 2008 IBM Corporation
IBM Software Group
Hiding Communication using Double Buffering
Double Buffering
dma_get(B’, B[0], 400);
dma_get(C’, C[0], 400);
for(i=0;i<99800;i+=200)
Single Buffering
dma_get(B”, B[i+100], 400);
dma_get(C”, C[i+100], 400);
Original Code
for( i=0; i<100000; i++)
A[i] = B[i] + C[i];
for( i=0; i<100000; i+=100)
for( ii=0; ii<100; ii++)
dma_get(B’, B[i], 400);
A’[ii] = B’[ii] + C’[ii];
dma_get(C’, C[i], 400);
dma_put(A[i],A’,400);
for( ii=0; ii<100; ii++)
dma_get(B’, B[i+200], 400);
A’[ii] = B’[ii] + C’[ii];
dma_get(C’, C[i+200], 400);
dma_put(A[i], A’, 400);
for( ii=100; ii<200; ii++)
A”[ii] = B”[ii] + C”[ii];
dma_put(A[i+100], A”,400);
communication is blocked (100 elements at a time)
computation and communication overlap as
their phases are software pipelined
53
ECE540
for(ii=0;ii<100;ii++)
A’[ii] = B’[ii] + C’[ii];
dma_put(A[i+99900], A’, 400);
© 2008 IBM Corporation
IBM Software Group
Handling Irregular Accesses using Software Cache
Original Code
for(i=0;i<100000;i++)
= … D[ C[i] ];
Code with explicit Cache Lookup
for(i=0;i<100000;i++)
t=cache_lookup( D[ C[i] ]);
= … t;
Code Lookup Sequence
inline vector cache_lookup (addr)
if (cache_directory[addr&key_mask] != (addr&tag_mask))
miss_handler(addr);
return cache_data[addr&key_mask][addr&offset_mask];
miss handler DMA the required data, and some suitable quantity of surrounding data
higher degrees of associativity can be supported, for little extra cost on a SIMD processor
54
ECE540
© 2008 IBM Corporation
IBM Software Group
XL C/C++ Single Source Compiler –
Primary Usage Scenario
 Support existing OpenMP programs on CELL with little/no source
changes
 Allow performance tuning in parallel regions by calling out to SPU
routines
 SPU routines need to be aware of OpenMP and PPU/SPU
addresses
 Users need to DMA to/from PPU memory if passed a PPU address
 Users need to be aware of the software cache (flushes)
 Users can use __ea pointers to access software cache from SPU
code
 __ea only supported in C
55
ECE540
© 2008 IBM Corporation
IBM Software Group
Single Source Compiler Results
 Results for Swim, Mgrid, & some of their kernels
Speedup with 8 SPEs
12
softcache
10
optimization
8
6
4
2
0
swim calc1 calc2 calc3 mgrid resid
psinv
rprj3
baseline: execution on one single PPE
56
ECE540
© 2008 IBM Corporation
IBM Software Group
Questions?
57
ECE540
© 2008 IBM Corporation
IBM Software Group
Special Notices -- Trademarks
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in
other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM
offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained
in this document.
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions
on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give
you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY
10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives
only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or
guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the
results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations
and conditions.
IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions
worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment
type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal
without notice.
IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.
All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
Many of the features described in this document are operating system dependent and may not be available on Linux. For more information,
please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html
Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are
dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this
document may have been made on development-level systems. There is no guarantee these measurements will be the same on generallyavailable systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document
should verify the applicable data for their specific environment.
Revised January 19, 2006
58
ECE540
© 2008 IBM Corporation
IBM Software Group
Special Notices (Cont.) -- Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter,
Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo),
IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced MicroPartitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power
Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage,
VideoCharger, Virtualization Engine.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other
countries, or both.
Rambus is a registered trademark of Rambus, Inc.
XDR and FlexIO are trademarks of Rambus, Inc.
UNIX is a registered trademark in the United States, other countries or both.
Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Fedora is a trademark of Redhat, Inc.
Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.
Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.
AMD Opteron is a trademark of Advanced Micro Devices, Inc.
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.
TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).
SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and
SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).
AltiVec is a trademark of Freescale Semiconductor, Inc.
PCI-X and PCI Express are registered trademarks of PCI SIG.
InfiniBand™ is a trademark the InfiniBand® Trade Association
Other company, product and service names may be trademarks or service marks of others.
Revised July 23, 2006
59
ECE540
© 2008 IBM Corporation
IBM Software Group
Special Notices - Copyrights
(c) Copyright International Business Machines Corporation 2007.
All Rights Reserved. Printed in the United Sates January 2007.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM
IBM Logo
Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are NOT
intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in
death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM
product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under
the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon
for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division
1580 Route 52, Bldg. 504
Hopewell Junction, NY 12533-6351
60
ECE540
The IBM home page is http://www.ibm.com
The IBM Microelectronics Division home page is
http://www.chips.ibm.com
© 2008 IBM Corporation
IBM Software Group
Backup
61
ECE540
© 2008 IBM Corporation
IBM Software Group
History of IBM XL Compiler
 Joint effort between IBM Toronto Lab and the Yorktown Heights
Research Lab
 First available in 1990 on AIX (Power1-chip-based systems)
 Released on Linux for pSeries and iSeries in Feb. 2003
 Supported under OS/400 PASE environment on iSeries
 CELL support started at IBM Watson in 2001
 Alphaworks CELL delivered in November 2005
62
ECE540
© 2008 IBM Corporation
IBM Software Group
Multiple-Platform C/C++/Fortran
 Multiple platforms including AIX, Mac OS/X, OS/400, z/OS, z/VM,
Linux for iSeries and pSeries, PASE (AS400)
 Modular structure with common backend optimizers
 Compliant with ISO C 1989, ISO C 1999, ISO C++ 1998, Fortran
77/90/95, Open MP industry standard (V2.0)
 High degree of option compatibility across PowerPC platforms
 Widely accepted by scientific and technical communities
 gcc compatibility, e.g. support almost all gcc language extensions
63
ECE540
© 2008 IBM Corporation
IBM Software Group
Common Optimization Options
 -O0 (-qnoopt)
Some trivial optimizations done to improve compile time (!)
 -O/-O2
Most common TOBEY optimizations
Seems to be the most commonly used optimization
 -O3
Turn on all TOBEY optimizations (some take more time)
Implies –qhot=level=0 (basic loop optimizations)
Implies –qstrict (FP operations may be reordered, etc.)
 -qhot
High Order Transformations (Loop, SIMD)
 -O4 (PPU only)
Implies –qarch=auto –qtune=auto –qcache=auto –qipa=level=1 –qhot
 -O5 (PPU only)
Implies above and –qipa=level=2
Whole program analysis
 Specify optimization options at link as well as compile time for whole program (-O4/O5)
64
ECE540
© 2008 IBM Corporation
IBM Software Group
More Options
 -qpdf1 (PPU only)
Generate Profile Directed Feedback collecting code
Afterwards, run program with representative data
 -qpdf2 (PPU only)
Recompile using PDF data to do more optimizations
Can generate slower code if training run was not similar to final runs
 -Q/-qinline
Control inlining
 -qarch/-qtune
Generate code for a particular machine or family (such as ppc64)
Tune code for a machine or family
 -qcompact
Try to minimize code growth during optimizations
 Limits some inlining, unrolling, etc.
65
ECE540
© 2008 IBM Corporation
IBM Software Group
Instruction Starvation Situation
 There are 2 instruction buffers
Dual-Issue
Instruction
Logic
 up to 64 ops along the fall-through path
 First buffer is half-empty
initiate
refill
after
half
empty
66
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
 can initiate refill
 When MEM port is continuously used
 starvation occurs (no ops left in buffers)
instruction buffers
ECE540
© 2008 IBM Corporation
IBM Software Group
Instruction Starvation Prevention
 SPE has an explicit IFETCH op
Dual-Issue
Instruction
Logic
initiate
refill
after
half
empty
67
 which initiates a instruction fetch
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
FP
MEM
instruction buffer
ECE540
before
it is too
late to
hide
latency
 Scheduler monitors starvation situation
 when MEM port is continuously used
 insert IFETCH op within the red window
refill IFETCH latency
 Compiler design
 scheduler must keep track of code layout
 Hardware design
 IFETCH op is not needed if memory port is
idle for one or more cycles within the red
window
© 2008 IBM Corporation
IBM Software Group

Engineering Issues for Dual-Issue & Starvation
Prevention
Initially, the scheduling and bundling phases were separate
satisfy dual-issue &
instruction-starvation
constraints by adding nops
Code
(not sched)
Sched
find best schedule,
using latencies, issue, &
resource constraints
Code
(not bundled)
Bundle
Code
(sched & bundled)
Problem: Bundler adds an IFETCH to prevent starvation.
A better schedule could be found if the scheduler had known that.
But the schedule is already “finalized”.
68
ECE540
© 2008 IBM Corporation
IBM Software Group

Engineering Issues for Dual-Issue & Starvation
Prevention
We integrate Scheduling and Bundling tightly, on a cycle per cycle basis
satisfy dual-issue &
instruction-starvation
constraints by adding nops
Code
(not sched)
Sched
find best schedule,
using latencies, issue, &
resource constraints
69
ECE540
Code
(not bundled)
Bundle
Code
(sched & bundled)
© 2008 IBM Corporation