Vectorization for SIMD Architectures with Alignment

advertisement
IBM Research
Vectorization for SIMD Architectures
with Alignment Constraints
Alexandre Eichenberger, Peng Wu, Kevin O’Brien
IBM T.J. Watson Research Center,
Yorktown Heights, NY
IBM T.J. Watson Research Center, Yorktown Heights, NY
© 2002 IBM
Corporation
IBM Research
Overview
ÿ Background on simdization
SIMD units
prior compilation techniques
ÿ General approach
3 step approach
data reorganization graph
code generation
ÿ Performance evaluation
ÿ Summary
2
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Traditional Execution of a Loop
ÿ Sequential execution of
for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2];
b0
b1
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
c3
c4
c5
c6
c7
c8
c9 c10
load b[1]
c0
c1
c2
load c[2]
Memory streams
add
(grey is 1st iteration)
store a[3]
a0
3
a1
a2
a3
a4
a5
a6
Vectorization for SIMD Architectures with Alignment Constraints
a7
a8
a9 a10
Alexandre Eichenberger
IBM Research
Can Be Speeded Up using SIMD
ÿ Single Instruction Multiple Data (SIMD) units are popular
typical SIMD vectors are 16 bytes
compute 2 double, 4 float/int, 8 short, 16 char results
available on most platforms:
VMX/AltiVec (IBM PowerPC, Apple G5)
MMX/SSE (Intel x86),...
ÿ Theoretical speedups are high
e.g. factors of 2 → 16 depending on data types
ÿ Limited support for memory alignment
some have limited support (SSE2 has slower misaligned load ops)
some have no hardware misalignment support (VMX/AltiVec, VIS)
ÿ SIMD is hard to program
optimized code very dependent on data alignment
difficult both for programmers and compilers
4
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
SIMD May Suffer from Alignment Problem
ÿ SIMD memory unit only load/store 16 byte chunk of 16 byte aligned data
0x1000
b0
b1
0x1010
b2
b3
b4
b5
0x1020
b6
b7
b8
b9 b10
16-byte boundaries
vload b[1]
&b[1] = 0x1004
b0
b1
b2
b3
byte offset 4 in register
ÿ refer to vectorization for SIMD units as ÿ SIMDIZATION
5
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Alignment Problem (cont.)
ÿ Alignment matters in registers too, e.g. in our “b[i+1]+c[i+2]” example
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
16-byte boundaries
vload b[1]
offset 4 b0
b1
b2
b3
b0+ b1+ b2+ b3+
c0 c1 c2 c3
vadd
offset 8
c0
c1
c2
c3
this is not b[1]+c[2], ...
vload c[2]
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9 c10
16-byte boundaries
6
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
How to Align Data in Registers
ÿ If you need a vector of 4 values starting at misaligned address b[1]
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10 b11 b12
...
16-byte boundaries
vload b[1]
b0
b1
b2
vload b[5]
b3
b4
b5
b6
b7
vpermute is a generic permutation op
offset 4
vpermute
it is supported by most ISA,
e.g. vec_permute bytes on VMX/AltiVec
b1
b2
b3
b4
Aligned data: 1st element is at offset 0
offset 0
7
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
How to Align Data in Registers (cont.)
ÿ When you need a stream of vectors starting at address b[1]
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10 b11 b12
...
16-byte boundaries
vload b[1]
b0
b1
b2
vload b[5]
b3
b4
b5
b6
vload b[9]
b7
b8
b9 b10 b11
...
offset 4
vpermute
b1
b2
b3
vpermute
b4
b5
b6
b7
vpermute
b8
b9 b10 b11 b12
offset 0
8
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
How to Align Data in Registers (cont.)
ÿ Back to our “b[i+1]+c[i+2]” example
b0
b1
b2
b3
b4
b5
b6
b7
b8
b9 b10
16-byte boundaries
vload b[1],b[5] → vpermute
b1
offset 0 b1
b2
b3
b4
b1+ b2+ b3+ b4+
c2 c3 c4 c5
vadd
offset 0
c2
c3
c4
c5
this is b[1]+c[2], ...
vload c[2],c[6] → vpermute
c0
c1
c2
c3
c4
c5
c6
c7
c8
c9 c10
16-byte boundaries
9
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Prior Work Approach Evaluation
ÿ Initial approaches
variations on “peel until all memory streams are aligned”
ÿ Vast Compiler
[Crescent Bay Software]
one permute op per misaligned memory access
inefficient in the number of permute ops
which is often a critical resource, e.g. Mac G5:
2 SIMD memory pipes
1 SIMD permute pipe
ÿ Superword Level Parallelism
[Larsen et al, PLDI 2000]
for loops: “unroll and pack into SIMD”
exhibit the same tradeoffs as loop unrolling vs. modulo scheduling
backedge is a optimization barrier
ÿ Our focus is on loop dominated multimedia/gaming applications
10
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Paper’s Approach Overview
ÿ Loop based, with optimized data reorganization
1. Build data reorganization graph
2. Place “shift streams” using optimizing policies
3. Generate SIMD code
Data Reorganization Graph:
vload b[i+1]
Code:
vload c[i+2]
<simdized prologue>
offset 4
offset 8
vshiftstream
vshiftstream
for(i=0; i<100; i+=4)
<simdized loop body>
<simdized epilogue>
offset 4→12
offset 8→12
vadd
offset 12
vstore a[i+3]
offset 12
11
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Data Reorganization Graph
ÿ Streams: values addressed over the lifetime of a loop
memory streams, e.g. b[i+1]
b0 b1 b2 b3 b4 b5 b6 b7
(i=0..99 here)
...
b96 b97 b98 b99 b b ... b b
100 101 102 103
16-byte boundaries
offset = 4
register streams, output of stream operations, e.g. vload(b[i+1])
b0 b1 b2 b3
b4 b5 b6 b7
...
b96 b97 b98 b99
b b b b
100 101 102 103
offset = 4
ÿ Alignment of stream can be uniquely described by offset of 1st value
12
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Data Reorganization Graph (cont.)
ÿ Additional stream operation: vshiftstream(Oin, Oout)
shift input stream to offset Oout, e.g.
b0 b1 b2 b3
b4 b5 b6 b7
...
b96 b97 b98 b99
b b b b
100 101 102 103
offset = 4
vshiftstream(4,12)
b-2 b-1 b0 b1
b2 b3 b4 b5
...
b94 b95 b96 b97 b98 b99 b b
100 101
offset = 12
13
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Data Reorganization Graph (cont.)
ÿ vload/vstore(addr(i))
vload b[i+1]
offset from memory alignment
vload c[i+2]
offset 4
offset 8
vadd
offset ⊥
vstore a[i+3]
offset 12
14
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Data Reorganization Graph (cont.)
ÿ vload/vstore(addr(i))
vload b[i+1]
offset from memory alignment
vload c[i+2]
offset 4
offset 8
ÿ vadd(in1,in2)
all offsets must be identical
vadd
offset ⊥
vstore a[i+3]
offset 12
15
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Valid Data Reorganization Graph
ÿ vload/vstore(addr(i))
vload b[i+1]
offset from memory alignment
vload c[i+2]
offset 4
offset 8
ÿ vadd(in1,in2)
4
all offsets must be identical
8
vadd
ÿ Valid: no violation of offset definitions
offset ⊥
vadd def violation
12
vstore a[i+3]
offset 12
16
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Valid Data Reorganization Graph (cont.)
ÿ vload/vstore(addr(i))
vload b[i+1]
offset from memory alignment
vload c[i+2]
offset 4
ÿ vadd(in1,in2)
all offsets must be identical
offset 8
vshiftstream(4,12)
vshiftstream(8,12)
offset 12
offset 12
vadd
ÿ Valid: no violation of offset defs
offset 12
ÿ Make valid by suitably inserting
ÿ vshiftstream(Oin, Oout)
vstore a[i+3]
offset 12
shift streams to offset Oout
Where to insert shift streams ?
First focus of our work.
17
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Shift Stream Placement: Zero Policy
ÿ Shifts all misaligned streams to/from offset zero
least optimized, used for runtime alignment
vload b[i+1]
vload c[i+2]
offset 4
offset 8
vshiftstream(4,0)
vshiftstream(8,0)
offset 0
offset 0
vadd
offset 0
vshiftstream(0,12)
offset 12
vstore a[i+3]
offset 12
18
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Shift Stream Placement: Eager Policy
ÿ Eagerly shifts to store offset
offset 12 is the store alignment here
vload b[i+1]
vload c[i+2]
offset 4
offset 8
vshiftstream(4,12)
vshiftstream(8,12)
offset 12
offset 12
vadd
offset 12
vstore a[i+3]
offset 12
3 → 2 compared to Zero-Shift
19
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Shift Stream Placement: Lazy Policy
ÿ Lazily shifts to store offset
b[i+1] and c[i+1] have same alignment => delay shifting past add
vload b[i+1]
vload c[i+1]
offset 4
offset 4
vadd
offset 4
vshiftstream(4,12)
offset 12
vstore a[i+3]
offset 12
3 → 1 compared to Zero-Shift
20
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Shift Stream Placement: Dominant Policy
ÿ Lazily shifts to dominant offset
offset 4 is dominant ÿ align to it instead of store
vload c[i+2]
offset 8
vload b[i+1]
vshiftstream(8,4)
offset 4
offset 4
vadd
offset 4
vload d[i+1]
offset 4
vadd
offset 4
vshiftstream(4,12)
offset 12
vstore a[i+3]
offset 12
21
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Code Generation
ÿ SIMD code is generated from a valid data reorganization graph
simdized loop (steady state)
simdized prologue & epilogue with partial vector stores
no redundant load/computation is introduced
ÿ Handles
statements with stride one accesses, loop invariants (more is being added)
multiple statements with arbitrary misalignments
ÿ With alignment known at
compile time: arbitrary data reorganization graphs
runtime: Zero-shift policy only
limitation due to code generation issues
22
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Code Generation with Multiple Statements
ÿ SIMD execution of:
for (i=0; i<100; i++) { a[i] = ... ; b[i+1] = ...; c[i+3] = ... }
a[i]=
a0 a1 a2 b3
a4 a5 a6 a7
...
a96 a97 a98 a99
a a a a
100 101 102 103
b[i+1]=
b0 b1 b2 b3
b4 b5 b6 b7
...
b96 b97 b98 b99
b b b b
100 101 102 103
c[i+3]=
c0 c1 c2 b3
c3
c4 c5 c6 c7
...
c96 c97 c98 c99
c
c
c
c
100 101 102 103
loop prologue
(simdized)
loop steady state
(simdized)
loop epilogue
(simdized)
Implicit loop skewing
23
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Performance Study
ÿ Implementation
algorithm implemented in the IBM’s production XL compiler
ÿ Compares different shift placement policies
ZERO:
shifts to/from offset zero
[Vast compiler]
EAGER: eagerly shifts to store offset
LAZY:
lazily shifts to store offset
DOM:
lazily shifts to/from dominant offset
+ SEQ:
no simdization
ÿ Code Gen. redundancy elimination
sp:
using “software pipelining” technique within simdization
cse+:
using separate CSE phase, CSE among consecutive loop iterations
+ ∅
24
no redundancy elimination
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Performance Study (cont.)
ÿ Initial experiments
16 byte SIMD units with 4 floats per vector register
loops with 6 loads, 1 store, offsets randomly selected
loaded values are simply added together
ÿ 7 mem + 5 add = 12 ops
ÿ Performance numbers
operations per datum
harmonic mean over 50 loops with different random offsets
ÿ Reports
provably lower bounds
operation count from a cycle accurate simulator, including
actual computations
address computations
branching overhead, spills, ...
25
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Selecting Best Code Gen. Scheme (7 Mem, 5 Add Loops)
10.18
7.0
12.00
Lower Bound + Actual Shift + Compiler Overhead
Lower Bound + Actual Shift Overhead
6.5
Lower bound
Operations / datum
6.0
5.5
5.0
4.5
4.0
3.5
3.0
L
LA
L
ZY AZY AZY
s
cs
e+ p
DO
DO
DO
M
M
M
cs
sp
e+
E
EA
E
GE AGE AGE
R
R
R
sp
cs
e+
ZE
Z
Z
RO ERO ERO
cs
sp
e+
SE
Q
50 loops with randomly generated offsets, harmonic mean operations per datum
26
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Comparing Shift Policies
(7 Mem, 5 Add Loops)
12.00
6.0
5.8
5.6
5.4
Lower Bound + Actual Shift + Compiler Overhead
Lower Bound + Actual Shift Overhead
Lower bound
Operations / datum
5.2
4.96
5.0
4.8
4.6
4.4
4.13
4.2
4.23
4.02
4.0
3.8
3.6
3.4
3.2
3.0
LAZY cse+
DOM sp
EAGER cse+
ZERO cse+
SEQ
50 loops with randomly generated offsets, harmonic mean operations per datum
27
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Comparing Shift Policies
(Using Offset Reassociation)
6
5.8
Lower Bound + Actual Shift + Compiler Overhead
5.6
5.4
Operations / datum
5.2
Lower Bound + Actual Shift Overhead
Lower bound
5
4.8
4.6
4.4
4.2
4
3.8
3.6
3.4
3.2
3
LAZY cse+
DOM sp
EAGER cse+
ZERO cse+
SEQ
50 loops with randomly generated offsets, harmonic mean operations per datum
28
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
IBM Research
Summary
ÿ General framework to optimize data reorganization
stream concept to view values through lifetime of a loop
data reorganization graph to minimize reorganization due to misalignment
extendable
ÿ Efficient code generation
support for arbitrarily optimized graph (for compile time)
simdization without redundant load/computations
simdized prologue/epilogue (important for short trip)
ÿ Good performance in presence of misaligned data
with 75% or more of the data misaligned
speedup factor of 3.71 with 4 floats per register
speedup factor of 6.06 with 8 shorts per registers
29
Vectorization for SIMD Architectures with Alignment Constraints
Alexandre Eichenberger
Download