IBM Software Group IBM XL - Compiling for CELL Mark Mendell March 24, 2008 1 © 2002 IBM © 2008 IBM Corporation Corporation IBM Software Group Topics Quick IBM XL Static Compiler Overview CELL Broadband Engine Generating Single Instruction Multiple Data (SIMD) CELL “Single Source” compiler 2 ECE540 © 2008 IBM Corporation IBM Software Group The XL Compiler Architecture Compile Step Optimization FORTRAN FE C++ FE C FE Link Step Optimization Wcode Wcode Wcode Libraries TPO Wcode+ Wcode EXE PDF info Partitions Partitions Wcode+ System Linker Optimized Objects TOBEY … 3 ECE540 DLL Wcode IPA Objects SPU Instrumented runs PPU Other Objects © 2008 IBM Corporation IBM Software Group TOBEY (Toronto Optimizing Back-End with Yorktown) Low Level (Machine Level ) Optimizer Traditional Optimizations Value Numbering Constant Propagation Commoning Unrolling Inlining Strength Reduction (Reassociation) Dead code Dead Store Many other optimizations Scheduling Global, Local (before register allocation, after register allocation) Superblock Swing Modulo Register Allocation Prologue/Epilogue Assembler or Object Output Object Listing 4 ECE540 © 2008 IBM Corporation IBM Software Group TPO (Toronto Portable Optimizer) High Level Optimizer Works on WCODE (input and output) IPA (Interprocedural Analysis) optimizes across and entire application rather than one file at a time PDF (Profile-Directed Feedback) gathers information from sample runs and retunes optimization accordingly SMP (Symmetric Multiprocessing) optimization including automatically parallelizing single-threaded applications HOT (High Order Transformations) loop optimizations to improve cache utilization SIMD (Single Instruction – Multiple Data) replace scalar code with SIMD code 5 ECE540 © 2008 IBM Corporation IBM Software Group More TPO Optimizations Traditional data flow optimizations Loop analysis and transformation Parallelization (Open MP) SIMD exploitation Whole program optimization Data reorganization Data shape and affinity analysis Splitting/grouping/compression/interleaving Profile-directed optimization Value Range Propagation (VRP) Keeps track of relative values of expressions in a program: For example “x<y+1” or “x!=0” Automatic Parallelization 6 ECE540 © 2008 IBM Corporation IBM Software Group Cell Broadband Engine Multiprocessor on a chip Power Proc. Element (PPE) general purpose running full-fledged OSs Synergistic Proc. Element (SPE) optimized for compute density Performance is achieved by parallelizing across all the heterogeneous processing elements 7 ECE540 © 2008 IBM Corporation IBM Software Group Cell Broadband Engine SPE SPE SPE SPE SPE SPE SPE SPE Heterogeneous, multi-core engine 1 multi-threaded power processor up to 8 compute-intensive-ISA engines Local Memories Element Interconnect Bus (96 Bytes/cycle) fast access to 256KB local memories 8 Bytes (per dir) 8 16Bytes (one dir) ECE540 To External IO To External Mem L2 Power Processor Element (PPE) L1 globally coherent DMA to transfer data Pervasive SIMD PPE has VMX SPEs are SIMD-only engines High bandwidth 128Bytes (one dir) fast internal bus (200GB/s) dual XDRTM controller (25.6GB/s) two configurable interfaces (76.8GB/s) numbers based on 3.2GHz clock rate © 2008 IBM Corporation IBM Software Group Outline Automatic tuning for each ISA 9 Explicit SIMD coding ECE540 Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 2: Automatic simdization Part 3: Shared memory & Single program abstr. SIMD/alignment directives Automatic simdization PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization © 2008 IBM Corporation IBM Software Group SPE Features Optimized for by the Compiler Synergistic Processing Element (SPE) SIMD-only functional units 16-bytes register/memory accesses Even Pipe Floating/ Fixed Point 1 Odd Pipe Branch Memory Permute Dual-Issue Instruction Logic no hardware branch predictor compiler managed hint/predication Instr.Buffer Register File (3.5 x 32 instr) (128 x 16Byte register) must be parallel & properly aligned Local Store (256 KByte, Single Ported) 3 DMA (Globally-Coherent) 10 16 bytes (one dir) ECE540 Dual-issue for instructions full dependence check in hardware 2 8 bytes (per dir) Simplified branch architecture branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 Single-ported local memory aligned accesses only contentions alleviated by compiler 128 bytes (one dir) © 2008 IBM Corporation IBM Software Group Feature #1: SPE’s Functional Units are SIMD Only SPE Functional units are SIMD only all transfers are 16 Bytes wide, Even Pipe Floating/ Fixed Point 1 Odd Pipe Branch Memory Permute Dual-Issue Instruction Logic including register file and memory How do we handle scalar code? Instr.Buffer Register File (3.5 x 32 instr) (128 x 16Byte register) 2 Local Store (256 KByte, Single Ported) 3 DMA (Globally-Coherent) 8 bytes (per dir) 11 16 bytes (one dir) ECE540 branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128 bytes (one dir) © 2008 IBM Corporation IBM Software Group Single Instruction Multiple Data (SIMD) Meant to process multiple “b[i]+c[i]” data per operations 16-byte boundaries b0 b1 b2 b3 b4 b5 b6 b7 b8 memory stream b9 b10 registers R1 b0 b1 b2 b3 VADD R2 c0 c1 c2 c3 c0 c1 c2 c3 c4 c5 c6 b0+ b1+ b2+ b3+ c0 c1 c2 c3 c7 c8 c9 c10 R3 memory stream 16-byte boundaries 12 ECE540 © 2008 IBM Corporation IBM Software Group Scalar code on Scalar Functional Units Example: a[2] = b[1] + c[3] b-1 b0 b1 b2 b3 b4 b5 b6 b7 LOAD b[1] c-1 c0 c1 c2 c3 c4 c5 c6 b1 r1 b1 c3 r2 b1+ b1 c3 r3 c7 LOAD c[3] ADD STORE a[2] a-1 13 a0 ECE540 a1 b1+ a2 c3 a3 a4 a5 a6 a7 © 2008 IBM Corporation IBM Software Group Scalar Code on SIMD Functional Units Example: a[2] = b[1] + c[3] b-1 b0 b1 b2 b3 b4 c0 c1 b6 b7 16-byte boundaries LOAD b[1] c-1 b5 c2 c3 c4 c5 c6 b0 b1 b2 b3 r1 c0 c1 c2 b1 c3 r2 b0+ b1+ b3+ b3+ c0 c1 c3 c3 r3 c7 LOAD c[3] ADD Problem #1: Memory alignment defines data location in register Problem #2: Adding aligned values yield wrong result STORE a[2] a-1 14 b0+ b1+ b2+ b3+ a2 a0 a1 c0 c1 c2 a3 c3 a4 ECE540 a5 a6 a7 Problem #3: Vector store clobbers neighboring values © 2008 IBM Corporation IBM Software Group Scalar Load Handling Use read-rotate sequence b-1 b0 b1 b2 b3 b4 LOAD b[1] b0 b1 b2 b3 r1 b1 b2 b3 b0 r1’ ROTATE &b[1] Overhead (1 op, in blue) one quad-word byte rotate Outcome desired scalar value always in the first slot of the register this addresses Problems 1 & 2 15 ECE540 © 2008 IBM Corporation IBM Software Group Scalar Store Handling Use read-modify-write sequence a-1 a0 a1 a2 a3 a4 LOAD a[2] b1+ b1 c3 * * * Computations r3 CWD &a[2] SHUFFLE Generate proper insertion mask for &a[2] STORE a[2] b1+ a-1 a0 a1 b1 c3 a3 a4 Overhead (1 to 3 ops, in blue) one shuffle byte, one mask formation (may reuse), one load (may reuse) Outcome SIMD store does not clobber memory (this addresses Problem 3) 16 ECE540 © 2008 IBM Corporation IBM Software Group Optimizations for Scalar on SIMD Significant overhead for scalar load/store can be lowered For vectorizable code generate SIMD code directly to fully utilize SIMD units done by expert programmers or compilers For scalar variable allocate scalar variables in first slot, by themselves i * * * eliminate need for rotate when loading data is guaranteed to be in first slot (Problems 1 & 2) eliminate need for read-modify-write when storing other data in 16-byte line is garbage (Problem 3) wastes space !! 17 ECE540 © 2008 IBM Corporation IBM Software Group Feature #2: Software-Assisted Branch Architecture Branch architecture SPE no hardware branch-predictor, but: Even Pipe Floating/ Fixed Point 1 Odd Pipe Branch Memory Permute Dual-Issue Instruction Logic Instr. Buffer Register File (3.5 x 32 instr) (128 x 16Byte register) compare/select ops for predication software-managed branch-hint one hint active at a time Lowering overhead by predicating small if-then-else 2 hinting predictably taken branches Local Store (256 KByte, Single Ported) 3 DMA (Globally-Coherent) 8 bytes (per dir) 18 16 bytes (one dir) ECE540 branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128 bytes (one dir) © 2008 IBM Corporation IBM Software Group Feature #3: Software-Assisted Instruction Issue Dual-issue for Instructions SPE can dual-issue parallel instructions code layout constrains dual issuing Even Pipe Floating/ Fixed Point 1 Odd Pipe Branch Memory Permute Dual-Issue Instruction Logic Instr.Buffer Register File (3.5 x 32 instr) (128 x 16Byte register) 2 full dependence check in hardware Alleviate constraints by making the scheduler aware of code layout issue Local Store (256 KByte, Single Ported) 3 DMA (Globally-Coherent) 8 Bytes (per dir) 19 16Bytes (one dir) ECE540 branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128Bytes (one dir) © 2008 IBM Corporation IBM Software Group Alleviating Issue Restriction Scheduling to find the best possible schedule dependence graph modified to account for latency of false dependences Bundling ensure code layout restrictions keep track of even/odd code layout at all times swap parallel ops when needed insert (even or odd) nops when needed Engineering issues each function must start at known even/odd code layout boundary one cannot add any instructions after the last scheduling phase as it would impact the code layout and thus the dual-issuing constraints 20 ECE540 © 2008 IBM Corporation IBM Software Group Feature #4: Single-Ported Local Memory Local store is single ported SPE denser hardware asymmetric port Even Pipe Floating/ Fixed Point 1 Odd Pipe Branch Memory Permute Dual-Issue Instruction Logic static priority Instr.Buffer Register File If we are not careful, we may starve for instructions 2 Local Store (256 KByte, Single Ported) Ifetch 3 (Globally-Coherent) 8 Bytes (per dir) 21 16Bytes (one dir) ECE540 DMA > MEM > IFETCH (3.5 x 32 instr) (128 x 16Byte register) DMA 16 bytes for load/store ops 128 bytes for IFETCH/DMA branch: 1,2 branch hint: 1,2 instr. fetch: 2 dma request: 3 128Bytes (one dir) © 2008 IBM Corporation IBM Software Group Hinting Branches & Instruction Starvation Prevention SPE provides a HINT operation Dual-Issue Instruction Logic fetches the branch target into HINT buffer no penalty for correctly predicted branches FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM instruction buffers 22 ECE540 HINT buffer HINT br, target IFETCH window refill latency fetches ops from target; needs a min of 15 cycles and 8 intervening ops BRANCH if true target compiler inserts hints when beneficial Impact on instruction starvation after a correctly hinted branch, IFETCH window is smaller © 2008 IBM Corporation IBM Software Group SPE Optimization Results (Kernels) Relative reductions in execution time 1.0 0.9 0.8 0.7 0.6 0.5 Original +Bundle +Branch Hint + Ifetch single SPE performance, optimized, simdized code 23 ECE540 Av er ag e Sa xp y ul t at M M ne ra yX Y O C on vo lu tio n ck Li np a VL D LU EA ID FF T H uf fm an 0.4 (avg 1.00 → 0.78) © 2008 IBM Corporation IBM Software Group Outline Automatic tuning for each ISA 24 Explicit SIMD coding ECE540 Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 2: Automatic simdization Part 3: Shared memory & Single program abstr. SIMD/alignment directives Automatic simdization PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization © 2008 IBM Corporation IBM Software Group Successful Simdization Extract Parallelism Satisfy Constraints loop level alignment constraints for (i=0; i<256; i++) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries a[i] = vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b1 b2 b3 b4 basic-block level a[i+0] = a[i+1] = a[i+2] = a[i+3] = b0b1b2b3b4b5b6b7b8b9b10 R1 b0b1b2b3 R2 c0c1c2c3 data size conversion load b[i] b0+ b1+ b2+ b3+ c0c1c2c3 R3 SHORT + c0c1c2c3c4c5c6c7c8c9c10 load a[i] unpack unpack add load a[i+4] add store store INT 1 entire short loop for (i=0; i<8; i++) a[i] = multiple targets GENERIC VMX 25 ECE540 INT 2 SPE © 2008 IBM Corporation IBM Software Group Example of SIMD-Parallelism Extraction loop level for (i=0; i<256; i++) a[i] = Loop level SIMD for a single statement across consecutive iterations successful at: basic-block level a[i+0] = efficiently handling misaligned data pattern recognition (reduction, linear recursion) leverage loop transformations in most compilers a[i+1] = a[i+2] = a[i+3] = entire short loop for (i=0; i<8; i++) a[i] = 26 ECE540 [Bik et al, IJPP 2002] [VAST compiler, 2004] [Eichenberger et al, PLDI 2004] [Wu et al, CGO 2005] [Naishlos, GCC Developer’s Summit 2004] © 2008 IBM Corporation IBM Software Group Example of SIMD Constraints Alignment in SIMD units matters: alignment constraints consider “b[i+1] + c[i+0]” b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries 16-byte boundaries vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b0 b1 b2 b3 b4 b5 b6 b7 b1 b2 b3 b4 data size conversion vload b[1] this is not b[1] + c[0] b1 b2 b3 R1 b0 b2 + b0+ b1 b2+ b3+ c0 +c1 c2 c3 load b[i] SHORT load a[i] unpack unpack add load a[i+4] add store store INT 1 INT 2 R2 c0 c1 c2 c3 multiple targets GENERIC c0 c1 c2 c3 c4 c5 c6 c7 VMX SPE 16-byte boundaries 27 ECE540 © 2008 IBM Corporation IBM Software Group Example of SIMD Constraints (cont.) Alignment in SIMD units matters alignment constraints when alignments within inputs do not match must realign the data b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute b0 b1 b2 b3 b4 b5 b6 b7 b1 b2 b3 b4 data size conversion shuffle load b[i] SHORT load a[i] R1 b1 b2 b3 b4 + b1+ b2+ b3+ b4+ c0 c1 c2 c3 unpack unpack load a[i+4] add add store store INT 1 INT 2 R2 c0 c1 c2 c3 multiple targets GENERIC c0 c1 c2 c3 c4 c5 c6 c7 VMX SPE 16-byte boundaries 28 ECE540 © 2008 IBM Corporation IBM Software Group Automatic Simdization for Cell Integrated Approach loop level alignment constraints for (i=0; i<256; i++) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ... 16-byte boundaries extract at multiple levels a[i] = vload b[1] vload b[5] b0 b1 b2 b3 b4 b5 b6 b7 vpermute satisfy all SIMD constraints use “virtual SIMD vector” as glue b1 b2 b3 b4 basic-block level a[i+0] = a[i+1] = a[i+2] = a[i+3] = b0b1b2b3b4b5b6b7b8b9b10 R1 b0b1b2b3 R2 c0c1c2c3 data size conversion load b[i] b0+ b1+ b2+ b3+ c0c1c2c3 R3 SHORT + c0c1c2c3c4c5c6c7c8c9c10 load a[i] unpack unpack add load a[i+4] add store store INT 1 entire short loop Minimize alignment overhead lazily insert data reorganization for (i=0; i<8; i++) a[i] = INT 2 multiple targets GENERIC VMX SPU BG/L handle compile time & runtime alignment simdize prologue/epilogue for SPEs memory accesses are always safe on SPE Full throughput computations even in presence of data conversions manually unrolled loops... 29 ECE540 © 2008 IBM Corporation IBM Software Group A Unified Simdization Framework Global information gathering Pointer Analysis Alignment Analysis Constant Propagation … General Transformation for SIMD Dependence Elimination Simdization Data Layout Optimization Idiom Recognition Diagnostic output Simdization Straightline-code Simdization Loop-level Simdization architecture independent architecture specific BG VMX SIMD Intrinsic Generator CELL 30 ECE540 © 2008 IBM Corporation IBM Software Group SPE Simdization Results (Kernels) 30 25.3 26.2 Speedup factors 25 20 15 11.4 9.9 10 7.5 8.1 5 2.4 2.5 2.9 2.9 Linpack Swim-l2 FIR Autcor 0 Dot Checksum Alpha Product Blending Saxpy Mat Mult Average single SPE, optimized, automatic simdization vs. scalar code 31 ECE540 © 2008 IBM Corporation IBM Software Group Example Program – SIMD (noopt) float a[1000], b[1000], c[1000]; int main() { int i; for (i = 0; i < 1000; i++) a[i] = b[i] + c[i]; } Compile: spuxlc –S t.c 32 ECE540 .LC__3: ila lqd shli lqx rotqby ila lqx rotqby fa ila lqx cwx shufb stqx lqd ai lqd cwd shufb stqd il cgt brnz $2,b $3,32($1) $4,$3,2 $2,$2,$4 $2,$2,$4 $3,c $3,$3,$4 $3,$3,$4 $2,$2,$3 $3,a $5,$3,$4 $6,$4,$3 $2,$2,$5,$6 $2,$3,$4 $2,32($1) $3,$2,1 $2,32($1) $4,0($1) $2,$3,$2,$4 $2,32($1) $2,1000 $2,$2,$3 $2,.LC__3 © 2008 IBM Corporation IBM Software Group Example Program – SIMD (O2) float a[1000], b[1000], c[1000]; int main() { int i; for (i = 0; i < 1000; i++) a[i] = b[i] + c[i]; } Compile: spuxlc –S –O2 t.c 33 ECE540 .LC__3: ai lqx lqx lqx cwx rotqby rotqby fa shufb stqx ai brnz $5,$5,-1 $8,$2,$7 $9,$4,$7 $10,$6,$7 $11,$6,$7 $8,$8,$7 $9,$9,$7 $8,$8,$9 $8,$8,$10,$11 $8,$6,$7 $7,$7,4 $5,.LC__3 © 2008 IBM Corporation IBM Software Group Example Program – SIMD (O3 –qhot=SIMD) float a[1000], b[1000], c[1000]; int main() { int i; for (i = 0; i < 1000; i++) a[i] = b[i] + c[i]; } il hbrr ila ila ila il lnop $5,250 .LC__20,.LC__3 $2,a $4,b $6,c $9,0 ai lqx lqx fa nop stqx ai $5,$5,-1 $7,$4,$9 $8,$6,$9 $7,$7,$8 $1 $7,$2,$9 $9,$9,16 brnz $5,.LC__3 .LC__3: Compile: spuxlc –S –O3 –qhot t.c Unrolling & modulo scheduling disabled .LC__20: 34 ECE540 © 2008 IBM Corporation IBM Software Group SIMD Report -qreport Examine loop <1> on line 4 in file "t.c" Peeling scheme: peel simd statements for single align with the following characteristics: Prologue 0 blocked loops with max trip count of 0 Main loop orig ub is 1000u new ub is 1000u Epilogue 0 blocked loops with max trip count of 0 (simdizable) [] 35 ECE540 © 2008 IBM Corporation IBM Software Group SIMD report (cont’d) float a[1000], b[1001], c[1000]; int main() { int i; for (i = 0; i < 1000; i++) a[i] = b[i+1] + c[i]; } … (simdizable) [misalign() shift(1 compile-time)] 36 ECE540 © 2008 IBM Corporation IBM Software Group SIMD report (cont’d) float a[1001], c[1000]; int main() { int i; for (i = 0; i < 1000; i++) a[i] = a[i-1] + c[i]; } … recurrence on self: a[]0{6}:(flow):(1 ) 5 | a[]0[$.CIV0] = a[]0[$.CIV0 - 1] + c[]0[$.CIV0]; (non_simdizable) 37 ECE540 © 2008 IBM Corporation IBM Software Group Single Source Compiler 38 ECE540 © 2008 IBM Corporation IBM Software Group Outline Automatic tuning for each ISA 39 Explicit SIMD coding ECE540 Explicit parallelization with local memories SIMD PROGRAMS Multiple-ISA hand-tuned programs Part 2: Automatic simdization Part 3: Shared memory & Single program abstr. SIMD/alignment directives Automatic simdization PARALLELIZATION Part 1: Automatic SPE tuning Shared memory, single program abstraction Automatic parallelization © 2008 IBM Corporation IBM Software Group Cell Memory & DMA Architecture Local stores are mapped in global address space Main Memory* SPE #1 MMU Local LocalStore store#1 1 Local store 1 SPU can access/DMA memory ... set access rights ALIAS TO SPE #8 MMU LocalStore store#8 8 Local Local store 1 SPU SPE can initiate DMAs to any global addresses, TLBs MFC Regs including local stores of others. L1 QofS* / L3* PPE translation done by MMU L2 IO Devices Memory requests * external DMA Mapped 40 PPE ECE540 Note all elements may be masters, there are no designated slaves © 2008 IBM Corporation IBM Software Group Dual Source Compilation of a Cell Program Manual Compiling & Binding SPE Object PPE Source PPE Compiler PPE Source PPE program PPE Object PPE Linker SPE Libraries SPE Exec SPE Embedder SPE Object SPE Linker SPE Source SPE Compiler SPE Source Executable SPE code Data PPE Object PPE Object Memory Image PPE Libraries 41 ECE540 © 2008 IBM Corporation IBM Software Group SPE SPE SPE SPE SPE SPE SPE SPE Anatomy of a Cell Program PPE program for (i=0; i<10K; i++) A[i] = B[i] + C[i] Program SPE code To External Mem Power Processor Element (PPE) Element Interconnect Bus (96 Bytes/cycle) Cell A. Invoke PPE program B. SPE code “loop(lb, ub) A1. Invoke thread lib to start threads B1. dma_get B,C[lb : ub]; A2. Load SPE code “loop1” and initiate B2. for ( i=lb; i<ub; i++) A3. Wait for SPE to finish A[i] = B[i] + C[i]; B3. dma_put A[lb : ub]; 42 ECE540 Data Memory Image © 2008 IBM Corporation IBM Software Group “Single Source” Compiler User prepares an application as a collection of one or more source files containing OpenMP pragmas Compiler uses pragmas to partition code between PPE and SPE Compiler handles data transfers. identify accesses in SPE functions that refer to data in system memory locations use static buffers or software cache to transfer his data to/from SPE local stores Compiler handles code size Use Code partitioning for Single Source Automatic partitioning based on relationships and size 43 ECE540 © 2008 IBM Corporation IBM Software Group “Single Source” Compilation of a Cell Program Single Source Prog. PPE Source PPE Compiler PPE Source PPE program PPE Object PPE Linker SPE Libraries SPE Exec SPE Embedder SPE Object SPE Linker SPE Source SPE Object Executable SPE code Data PPE Object PPE Object Memory Image PPE Libraries Single Source 44 SPE Source SPE Compiler Architecture-Independent Compiler #pragma OMP parallel for for( i =0; i<10000; i++) A[i] = B[i] + C[i]; Single Source Automatic Compiling & Binding ECE540 © 2008 IBM Corporation IBM Software Group Compiling a single source file for the cell foo3(LB,UB) Single source foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; for (i=LB; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo3_SPU (LB,UB) for (i=LB; i < UB; i++) A[i] = x * B[i]; foo2 (); Runtime barrier foo1 (); Runtime distribution of work: invoke foo3_SPU, for i=[0,N) Runtime barrier foo2 (); 45 ECE540 In SPE code: A, B, and x are shared © 2008 IBM Corporation IBM Software Group Compiling a single source file for the cell foo3(LB,UB) Single source foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; foo3_SPU (LB,UB) /** buffers A´[M], B´[M] **/ foo2 (); foo1 (); Runtime distribution of work: invoke foo3_SPU, for i=[0,N) Runtime barrier foo2 (); 46 for (i=LB; i < UB; i++) A[i] = x * B[i]; Runtime barrier ECE540 for ( k=LB; k < UB; k+=M) { DMA M elements of B into B´ for (j=0; j<M; j++) { A´[j] = cache_lookup(x) * B´[j]; } DMA M elements of A out of A´ } Runtime barrier © 2008 IBM Corporation IBM Software Group Using OpenMP to partition/parallelize across Cell A single source program contains C, C++ or Fortran with OpenMP user directives or pragmas Compiler “outlines” all code within the pragmas into separate functions compiled for the SPE. Replaces outlined code with call to the parallel runtime and compiles this code for the PPE Master thread executes on PPE PPE Runtime places outlined functions on a work queue containing information about number of iterations to execute, or ‘chunk’ size for each SPE Creates up to 16 SPE threads to pull work items (outlined parallel functions) from queue and execute on SPEs May wait for SPE completion, or proceed with other PPE statement execution 47 ECE540 © 2008 IBM Corporation IBM Software Group Why OpenMP directives ? Reasonable acceptance in the industry – growing with the increasing ubiquity of multi-core System on a Chip (SOC) Allows us to sidestep the issues of auto-parallelization detection – for now Simplifies memory consistency issues – adhere to OpenMP shared memory, relaxed consistency model May be extensible to address future accelerator specific features Provides a path to fully automatic approach based on underlying compiler support 48 ECE540 © 2008 IBM Corporation IBM Software Group PPE Runtime First OMP construct initializes the runtime system create SPE threads and loads the SPE runtime create work queue and get DMA queue addresses send address of work queue to each SPE set global options Sends a “setup_done” to SPEs after partitioning/scheduling the work items Parallel regions all run on SPEs ... omp_rte_init(); omp_rte_do_par(ol$1); ... master thread void ol$1_PPE(LB, UB) for( i=LB; i<UB; i++) A[i] = B[i] + D[ C[i] ]; PPE worker thread PPE worker thread (optional) 49 ECE540 © 2008 IBM Corporation IBM Software Group SPE Runtime Infinite loop waiting for signals from PPE runtime DMA fetches work items from work queue in system memory Depending on the work type: translates the address of SPE outlined procedure from PPE outlined procedure invokes SPE outlined procedures. Loop continuously Looking for work 50 ECE540 © 2008 IBM Corporation IBM Software Group Runtime Interaction ... omp_rte_init(); omp_rte_do_par(ol$1); ... master thread Software cache PPE RUNTIME •Partitioning •Scheduling •Synchronization •Communication SYSTEM SPE RUNTIME •Perform work items •Communication MEMORY void ol$1_SPE(LB, UB) for( k=LB; k<UB; k+=100) DMA 100 B,C elements into B’,C’ for ( i=0; i<100; i++) A’[i] = B’[i] + cache_lookup(D[ C’[i] ]); DMA 100 A elements out of A’ 51 ECE540 © 2008 IBM Corporation IBM Software Group Competing for the SPE Local-Store Local store is fast, needs support when full. irregular data Provide compiler support: SPE code too large compiler partitions code partition manager pulls in code as needed code regular data Local Store Data with regular accesses is too large compiler stages data in & out using static buffering can hide latencies by using double buffering Data with irregular accesses is present e.g. indirection, runtime pointers... use a software cache approach to pull the data in & out (last resort solution) 52 ECE540 © 2008 IBM Corporation IBM Software Group Hiding Communication using Double Buffering Double Buffering dma_get(B’, B[0], 400); dma_get(C’, C[0], 400); for(i=0;i<99800;i+=200) Single Buffering dma_get(B”, B[i+100], 400); dma_get(C”, C[i+100], 400); Original Code for( i=0; i<100000; i++) A[i] = B[i] + C[i]; for( i=0; i<100000; i+=100) for( ii=0; ii<100; ii++) dma_get(B’, B[i], 400); A’[ii] = B’[ii] + C’[ii]; dma_get(C’, C[i], 400); dma_put(A[i],A’,400); for( ii=0; ii<100; ii++) dma_get(B’, B[i+200], 400); A’[ii] = B’[ii] + C’[ii]; dma_get(C’, C[i+200], 400); dma_put(A[i], A’, 400); for( ii=100; ii<200; ii++) A”[ii] = B”[ii] + C”[ii]; dma_put(A[i+100], A”,400); communication is blocked (100 elements at a time) computation and communication overlap as their phases are software pipelined 53 ECE540 for(ii=0;ii<100;ii++) A’[ii] = B’[ii] + C’[ii]; dma_put(A[i+99900], A’, 400); © 2008 IBM Corporation IBM Software Group Handling Irregular Accesses using Software Cache Original Code for(i=0;i<100000;i++) = … D[ C[i] ]; Code with explicit Cache Lookup for(i=0;i<100000;i++) t=cache_lookup( D[ C[i] ]); = … t; Code Lookup Sequence inline vector cache_lookup (addr) if (cache_directory[addr&key_mask] != (addr&tag_mask)) miss_handler(addr); return cache_data[addr&key_mask][addr&offset_mask]; miss handler DMA the required data, and some suitable quantity of surrounding data higher degrees of associativity can be supported, for little extra cost on a SIMD processor 54 ECE540 © 2008 IBM Corporation IBM Software Group XL C/C++ Single Source Compiler – Primary Usage Scenario Support existing OpenMP programs on CELL with little/no source changes Allow performance tuning in parallel regions by calling out to SPU routines SPU routines need to be aware of OpenMP and PPU/SPU addresses Users need to DMA to/from PPU memory if passed a PPU address Users need to be aware of the software cache (flushes) Users can use __ea pointers to access software cache from SPU code __ea only supported in C 55 ECE540 © 2008 IBM Corporation IBM Software Group Single Source Compiler Results Results for Swim, Mgrid, & some of their kernels Speedup with 8 SPEs 12 softcache 10 optimization 8 6 4 2 0 swim calc1 calc2 calc3 mgrid resid psinv rprj3 baseline: execution on one single PPE 56 ECE540 © 2008 IBM Corporation IBM Software Group Questions? 57 ECE540 © 2008 IBM Corporation IBM Software Group Special Notices -- Trademarks This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY 10504-1785 USA. All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives only. The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or guarantees either expressed or implied. All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations and conditions. IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal without notice. IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies. All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary. IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply. Many of the features described in this document are operating system dependent and may not be available on Linux. For more information, please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this document may have been made on development-level systems. There is no guarantee these measurements will be the same on generallyavailable systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document should verify the applicable data for their specific environment. Revised January 19, 2006 58 ECE540 © 2008 IBM Corporation IBM Software Group Special Notices (Cont.) -- Trademarks The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter, Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo), IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced MicroPartitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger, Virtualization Engine. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries, or both. Rambus is a registered trademark of Rambus, Inc. XDR and FlexIO are trademarks of Rambus, Inc. UNIX is a registered trademark in the United States, other countries or both. Linux is a trademark of Linus Torvalds in the United States, other countries or both. Fedora is a trademark of Redhat, Inc. Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both. Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. AMD Opteron is a trademark of Advanced Micro Devices, Inc. Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries. TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC). SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC). AltiVec is a trademark of Freescale Semiconductor, Inc. PCI-X and PCI Express are registered trademarks of PCI SIG. InfiniBand™ is a trademark the InfiniBand® Trade Association Other company, product and service names may be trademarks or service marks of others. Revised July 23, 2006 59 ECE540 © 2008 IBM Corporation IBM Software Group Special Notices - Copyrights (c) Copyright International Business Machines Corporation 2007. All Rights Reserved. Printed in the United Sates January 2007. The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both. IBM IBM Logo Power Architecture Other company, product and service names may be trademarks or service marks of others. All information contained in this document is subject to change without notice. The products described in this document are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary. While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made. THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document. IBM Microelectronics Division 1580 Route 52, Bldg. 504 Hopewell Junction, NY 12533-6351 60 ECE540 The IBM home page is http://www.ibm.com The IBM Microelectronics Division home page is http://www.chips.ibm.com © 2008 IBM Corporation IBM Software Group Backup 61 ECE540 © 2008 IBM Corporation IBM Software Group History of IBM XL Compiler Joint effort between IBM Toronto Lab and the Yorktown Heights Research Lab First available in 1990 on AIX (Power1-chip-based systems) Released on Linux for pSeries and iSeries in Feb. 2003 Supported under OS/400 PASE environment on iSeries CELL support started at IBM Watson in 2001 Alphaworks CELL delivered in November 2005 62 ECE540 © 2008 IBM Corporation IBM Software Group Multiple-Platform C/C++/Fortran Multiple platforms including AIX, Mac OS/X, OS/400, z/OS, z/VM, Linux for iSeries and pSeries, PASE (AS400) Modular structure with common backend optimizers Compliant with ISO C 1989, ISO C 1999, ISO C++ 1998, Fortran 77/90/95, Open MP industry standard (V2.0) High degree of option compatibility across PowerPC platforms Widely accepted by scientific and technical communities gcc compatibility, e.g. support almost all gcc language extensions 63 ECE540 © 2008 IBM Corporation IBM Software Group Common Optimization Options -O0 (-qnoopt) Some trivial optimizations done to improve compile time (!) -O/-O2 Most common TOBEY optimizations Seems to be the most commonly used optimization -O3 Turn on all TOBEY optimizations (some take more time) Implies –qhot=level=0 (basic loop optimizations) Implies –qstrict (FP operations may be reordered, etc.) -qhot High Order Transformations (Loop, SIMD) -O4 (PPU only) Implies –qarch=auto –qtune=auto –qcache=auto –qipa=level=1 –qhot -O5 (PPU only) Implies above and –qipa=level=2 Whole program analysis Specify optimization options at link as well as compile time for whole program (-O4/O5) 64 ECE540 © 2008 IBM Corporation IBM Software Group More Options -qpdf1 (PPU only) Generate Profile Directed Feedback collecting code Afterwards, run program with representative data -qpdf2 (PPU only) Recompile using PDF data to do more optimizations Can generate slower code if training run was not similar to final runs -Q/-qinline Control inlining -qarch/-qtune Generate code for a particular machine or family (such as ppc64) Tune code for a machine or family -qcompact Try to minimize code growth during optimizations Limits some inlining, unrolling, etc. 65 ECE540 © 2008 IBM Corporation IBM Software Group Instruction Starvation Situation There are 2 instruction buffers Dual-Issue Instruction Logic up to 64 ops along the fall-through path First buffer is half-empty initiate refill after half empty 66 FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM can initiate refill When MEM port is continuously used starvation occurs (no ops left in buffers) instruction buffers ECE540 © 2008 IBM Corporation IBM Software Group Instruction Starvation Prevention SPE has an explicit IFETCH op Dual-Issue Instruction Logic initiate refill after half empty 67 which initiates a instruction fetch FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM FP MEM instruction buffer ECE540 before it is too late to hide latency Scheduler monitors starvation situation when MEM port is continuously used insert IFETCH op within the red window refill IFETCH latency Compiler design scheduler must keep track of code layout Hardware design IFETCH op is not needed if memory port is idle for one or more cycles within the red window © 2008 IBM Corporation IBM Software Group Engineering Issues for Dual-Issue & Starvation Prevention Initially, the scheduling and bundling phases were separate satisfy dual-issue & instruction-starvation constraints by adding nops Code (not sched) Sched find best schedule, using latencies, issue, & resource constraints Code (not bundled) Bundle Code (sched & bundled) Problem: Bundler adds an IFETCH to prevent starvation. A better schedule could be found if the scheduler had known that. But the schedule is already “finalized”. 68 ECE540 © 2008 IBM Corporation IBM Software Group Engineering Issues for Dual-Issue & Starvation Prevention We integrate Scheduling and Bundling tightly, on a cycle per cycle basis satisfy dual-issue & instruction-starvation constraints by adding nops Code (not sched) Sched find best schedule, using latencies, issue, & resource constraints 69 ECE540 Code (not bundled) Bundle Code (sched & bundled) © 2008 IBM Corporation