IBM Research Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger, Peng Wu, Kevin O’Brien IBM T.J. Watson Research Center, Yorktown Heights, NY IBM T.J. Watson Research Center, Yorktown Heights, NY © 2002 IBM Corporation IBM Research Overview ÿ Background on simdization SIMD units prior compilation techniques ÿ General approach 3 step approach data reorganization graph code generation ÿ Performance evaluation ÿ Summary 2 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Traditional Execution of a Loop ÿ Sequential execution of for (i=0; i<100; i++) a[i+3] = b[i+1] + c[i+2]; b0 b1 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 c3 c4 c5 c6 c7 c8 c9 c10 load b[1] c0 c1 c2 load c[2] Memory streams add (grey is 1st iteration) store a[3] a0 3 a1 a2 a3 a4 a5 a6 Vectorization for SIMD Architectures with Alignment Constraints a7 a8 a9 a10 Alexandre Eichenberger IBM Research Can Be Speeded Up using SIMD ÿ Single Instruction Multiple Data (SIMD) units are popular typical SIMD vectors are 16 bytes compute 2 double, 4 float/int, 8 short, 16 char results available on most platforms: VMX/AltiVec (IBM PowerPC, Apple G5) MMX/SSE (Intel x86),... ÿ Theoretical speedups are high e.g. factors of 2 → 16 depending on data types ÿ Limited support for memory alignment some have limited support (SSE2 has slower misaligned load ops) some have no hardware misalignment support (VMX/AltiVec, VIS) ÿ SIMD is hard to program optimized code very dependent on data alignment difficult both for programmers and compilers 4 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research SIMD May Suffer from Alignment Problem ÿ SIMD memory unit only load/store 16 byte chunk of 16 byte aligned data 0x1000 b0 b1 0x1010 b2 b3 b4 b5 0x1020 b6 b7 b8 b9 b10 16-byte boundaries vload b[1] &b[1] = 0x1004 b0 b1 b2 b3 byte offset 4 in register ÿ refer to vectorization for SIMD units as ÿ SIMDIZATION 5 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Alignment Problem (cont.) ÿ Alignment matters in registers too, e.g. in our “b[i+1]+c[i+2]” example b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 16-byte boundaries vload b[1] offset 4 b0 b1 b2 b3 b0+ b1+ b2+ b3+ c0 c1 c2 c3 vadd offset 8 c0 c1 c2 c3 this is not b[1]+c[2], ... vload c[2] c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 16-byte boundaries 6 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research How to Align Data in Registers ÿ If you need a vector of 4 values starting at misaligned address b[1] b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 ... 16-byte boundaries vload b[1] b0 b1 b2 vload b[5] b3 b4 b5 b6 b7 vpermute is a generic permutation op offset 4 vpermute it is supported by most ISA, e.g. vec_permute bytes on VMX/AltiVec b1 b2 b3 b4 Aligned data: 1st element is at offset 0 offset 0 7 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research How to Align Data in Registers (cont.) ÿ When you need a stream of vectors starting at address b[1] b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 ... 16-byte boundaries vload b[1] b0 b1 b2 vload b[5] b3 b4 b5 b6 vload b[9] b7 b8 b9 b10 b11 ... offset 4 vpermute b1 b2 b3 vpermute b4 b5 b6 b7 vpermute b8 b9 b10 b11 b12 offset 0 8 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research How to Align Data in Registers (cont.) ÿ Back to our “b[i+1]+c[i+2]” example b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 16-byte boundaries vload b[1],b[5] → vpermute b1 offset 0 b1 b2 b3 b4 b1+ b2+ b3+ b4+ c2 c3 c4 c5 vadd offset 0 c2 c3 c4 c5 this is b[1]+c[2], ... vload c[2],c[6] → vpermute c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 16-byte boundaries 9 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Prior Work Approach Evaluation ÿ Initial approaches variations on “peel until all memory streams are aligned” ÿ Vast Compiler [Crescent Bay Software] one permute op per misaligned memory access inefficient in the number of permute ops which is often a critical resource, e.g. Mac G5: 2 SIMD memory pipes 1 SIMD permute pipe ÿ Superword Level Parallelism [Larsen et al, PLDI 2000] for loops: “unroll and pack into SIMD” exhibit the same tradeoffs as loop unrolling vs. modulo scheduling backedge is a optimization barrier ÿ Our focus is on loop dominated multimedia/gaming applications 10 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Paper’s Approach Overview ÿ Loop based, with optimized data reorganization 1. Build data reorganization graph 2. Place “shift streams” using optimizing policies 3. Generate SIMD code Data Reorganization Graph: vload b[i+1] Code: vload c[i+2] <simdized prologue> offset 4 offset 8 vshiftstream vshiftstream for(i=0; i<100; i+=4) <simdized loop body> <simdized epilogue> offset 4→12 offset 8→12 vadd offset 12 vstore a[i+3] offset 12 11 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Data Reorganization Graph ÿ Streams: values addressed over the lifetime of a loop memory streams, e.g. b[i+1] b0 b1 b2 b3 b4 b5 b6 b7 (i=0..99 here) ... b96 b97 b98 b99 b b ... b b 100 101 102 103 16-byte boundaries offset = 4 register streams, output of stream operations, e.g. vload(b[i+1]) b0 b1 b2 b3 b4 b5 b6 b7 ... b96 b97 b98 b99 b b b b 100 101 102 103 offset = 4 ÿ Alignment of stream can be uniquely described by offset of 1st value 12 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Data Reorganization Graph (cont.) ÿ Additional stream operation: vshiftstream(Oin, Oout) shift input stream to offset Oout, e.g. b0 b1 b2 b3 b4 b5 b6 b7 ... b96 b97 b98 b99 b b b b 100 101 102 103 offset = 4 vshiftstream(4,12) b-2 b-1 b0 b1 b2 b3 b4 b5 ... b94 b95 b96 b97 b98 b99 b b 100 101 offset = 12 13 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Data Reorganization Graph (cont.) ÿ vload/vstore(addr(i)) vload b[i+1] offset from memory alignment vload c[i+2] offset 4 offset 8 vadd offset ⊥ vstore a[i+3] offset 12 14 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Data Reorganization Graph (cont.) ÿ vload/vstore(addr(i)) vload b[i+1] offset from memory alignment vload c[i+2] offset 4 offset 8 ÿ vadd(in1,in2) all offsets must be identical vadd offset ⊥ vstore a[i+3] offset 12 15 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Valid Data Reorganization Graph ÿ vload/vstore(addr(i)) vload b[i+1] offset from memory alignment vload c[i+2] offset 4 offset 8 ÿ vadd(in1,in2) 4 all offsets must be identical 8 vadd ÿ Valid: no violation of offset definitions offset ⊥ vadd def violation 12 vstore a[i+3] offset 12 16 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Valid Data Reorganization Graph (cont.) ÿ vload/vstore(addr(i)) vload b[i+1] offset from memory alignment vload c[i+2] offset 4 ÿ vadd(in1,in2) all offsets must be identical offset 8 vshiftstream(4,12) vshiftstream(8,12) offset 12 offset 12 vadd ÿ Valid: no violation of offset defs offset 12 ÿ Make valid by suitably inserting ÿ vshiftstream(Oin, Oout) vstore a[i+3] offset 12 shift streams to offset Oout Where to insert shift streams ? First focus of our work. 17 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Shift Stream Placement: Zero Policy ÿ Shifts all misaligned streams to/from offset zero least optimized, used for runtime alignment vload b[i+1] vload c[i+2] offset 4 offset 8 vshiftstream(4,0) vshiftstream(8,0) offset 0 offset 0 vadd offset 0 vshiftstream(0,12) offset 12 vstore a[i+3] offset 12 18 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Shift Stream Placement: Eager Policy ÿ Eagerly shifts to store offset offset 12 is the store alignment here vload b[i+1] vload c[i+2] offset 4 offset 8 vshiftstream(4,12) vshiftstream(8,12) offset 12 offset 12 vadd offset 12 vstore a[i+3] offset 12 3 → 2 compared to Zero-Shift 19 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Shift Stream Placement: Lazy Policy ÿ Lazily shifts to store offset b[i+1] and c[i+1] have same alignment => delay shifting past add vload b[i+1] vload c[i+1] offset 4 offset 4 vadd offset 4 vshiftstream(4,12) offset 12 vstore a[i+3] offset 12 3 → 1 compared to Zero-Shift 20 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Shift Stream Placement: Dominant Policy ÿ Lazily shifts to dominant offset offset 4 is dominant ÿ align to it instead of store vload c[i+2] offset 8 vload b[i+1] vshiftstream(8,4) offset 4 offset 4 vadd offset 4 vload d[i+1] offset 4 vadd offset 4 vshiftstream(4,12) offset 12 vstore a[i+3] offset 12 21 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Code Generation ÿ SIMD code is generated from a valid data reorganization graph simdized loop (steady state) simdized prologue & epilogue with partial vector stores no redundant load/computation is introduced ÿ Handles statements with stride one accesses, loop invariants (more is being added) multiple statements with arbitrary misalignments ÿ With alignment known at compile time: arbitrary data reorganization graphs runtime: Zero-shift policy only limitation due to code generation issues 22 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Code Generation with Multiple Statements ÿ SIMD execution of: for (i=0; i<100; i++) { a[i] = ... ; b[i+1] = ...; c[i+3] = ... } a[i]= a0 a1 a2 b3 a4 a5 a6 a7 ... a96 a97 a98 a99 a a a a 100 101 102 103 b[i+1]= b0 b1 b2 b3 b4 b5 b6 b7 ... b96 b97 b98 b99 b b b b 100 101 102 103 c[i+3]= c0 c1 c2 b3 c3 c4 c5 c6 c7 ... c96 c97 c98 c99 c c c c 100 101 102 103 loop prologue (simdized) loop steady state (simdized) loop epilogue (simdized) Implicit loop skewing 23 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Performance Study ÿ Implementation algorithm implemented in the IBM’s production XL compiler ÿ Compares different shift placement policies ZERO: shifts to/from offset zero [Vast compiler] EAGER: eagerly shifts to store offset LAZY: lazily shifts to store offset DOM: lazily shifts to/from dominant offset + SEQ: no simdization ÿ Code Gen. redundancy elimination sp: using “software pipelining” technique within simdization cse+: using separate CSE phase, CSE among consecutive loop iterations + ∅ 24 no redundancy elimination Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Performance Study (cont.) ÿ Initial experiments 16 byte SIMD units with 4 floats per vector register loops with 6 loads, 1 store, offsets randomly selected loaded values are simply added together ÿ 7 mem + 5 add = 12 ops ÿ Performance numbers operations per datum harmonic mean over 50 loops with different random offsets ÿ Reports provably lower bounds operation count from a cycle accurate simulator, including actual computations address computations branching overhead, spills, ... 25 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Selecting Best Code Gen. Scheme (7 Mem, 5 Add Loops) 10.18 7.0 12.00 Lower Bound + Actual Shift + Compiler Overhead Lower Bound + Actual Shift Overhead 6.5 Lower bound Operations / datum 6.0 5.5 5.0 4.5 4.0 3.5 3.0 L LA L ZY AZY AZY s cs e+ p DO DO DO M M M cs sp e+ E EA E GE AGE AGE R R R sp cs e+ ZE Z Z RO ERO ERO cs sp e+ SE Q 50 loops with randomly generated offsets, harmonic mean operations per datum 26 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Comparing Shift Policies (7 Mem, 5 Add Loops) 12.00 6.0 5.8 5.6 5.4 Lower Bound + Actual Shift + Compiler Overhead Lower Bound + Actual Shift Overhead Lower bound Operations / datum 5.2 4.96 5.0 4.8 4.6 4.4 4.13 4.2 4.23 4.02 4.0 3.8 3.6 3.4 3.2 3.0 LAZY cse+ DOM sp EAGER cse+ ZERO cse+ SEQ 50 loops with randomly generated offsets, harmonic mean operations per datum 27 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Comparing Shift Policies (Using Offset Reassociation) 6 5.8 Lower Bound + Actual Shift + Compiler Overhead 5.6 5.4 Operations / datum 5.2 Lower Bound + Actual Shift Overhead Lower bound 5 4.8 4.6 4.4 4.2 4 3.8 3.6 3.4 3.2 3 LAZY cse+ DOM sp EAGER cse+ ZERO cse+ SEQ 50 loops with randomly generated offsets, harmonic mean operations per datum 28 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger IBM Research Summary ÿ General framework to optimize data reorganization stream concept to view values through lifetime of a loop data reorganization graph to minimize reorganization due to misalignment extendable ÿ Efficient code generation support for arbitrarily optimized graph (for compile time) simdization without redundant load/computations simdized prologue/epilogue (important for short trip) ÿ Good performance in presence of misaligned data with 75% or more of the data misaligned speedup factor of 3.71 with 4 floats per register speedup factor of 6.06 with 8 shorts per registers 29 Vectorization for SIMD Architectures with Alignment Constraints Alexandre Eichenberger