CMPUT429/CMPE382 Winter 2001 Topic7: Instruction Level Parallelism Static Scheduling (Adapted from David A. Patterson’s CS252, Spring 2001 Lecture Slides) 1/17/01 CMPUT429/CMPE382 Amara, 1 Recall from Pipelining Review • Pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Stalls – Ideal pipeline CPI: measure of the maximum performance attainable by the implementation – Structural hazards: HW cannot support this combination of instructions – Data hazards: Instruction depends on result of prior instruction still in the pipeline – Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps) 1/17/01 CMPUT429/CMPE382 Amara, 2 Ideas to Reduce Stalls Chapter 3 Chapter 4 1/17/01 Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls CMPUT429/CMPE382 Amara, 3 Instruction-Level Parallelism (ILP) • Basic Block (BB) ILP is quite small – BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit; – If one instruction of the basic block is executed, then all the instructions in the basic block must be executed – average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches – Plus instructions in BB likely to depend on each other • To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks • Simplest: loop-level parallelism to exploit parallelism among iterations of a loop – Vector is one way – If not vector, then either dynamic execution via branch prediction or static scheduling via loop unrolling by compiler 1/17/01 CMPUT429/CMPE382 Amara, 4 Data Dependence and Hazards • InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • or InstrJ is data dependent on InstrK which is dependent on InstrI • Caused by a “True Dependence” (compiler term) • If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard 1/17/01 CMPUT429/CMPE382 Amara, 5 Data Dependence and Hazards • Dependences are a property of programs • Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline • Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated 3) sets an upper bound on how much parallelism can possibly be exploited • Today looking at HW schemes to avoid hazard 1/17/01 CMPUT429/CMPE382 Amara, 6 Name Dependence #1: Anti-dependence • Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence • InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1” • If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard 1/17/01 CMPUT429/CMPE382 Amara, 7 Name Dependence #2: Output dependence • InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r1” • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard 1/17/01 CMPUT429/CMPE382 Amara, 8 ILP and Data Hazards • HW/SW must preserve program order: order instructions would execute in if executed sequentially 1 at a time as determined by original source program • HW/SW goal: exploit parallelism by preserving program order only where it affects the outcome of the program • Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict – Register renaming resolves name dependence for regs – Either by compiler or by HW 1/17/01 CMPUT429/CMPE382 Amara, 9 Control Dependencies • Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } • S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. 1/17/01 CMPUT429/CMPE382 Amara, 10 Control Dependence Ignored • Control dependence need not be preserved – willing to execute instructions that should not have been executed, thereby violating the control dependences, if can do so without affecting correctness of the program • Instead, 2 properties critical to program correctness are exception behavior and data flow 1/17/01 CMPUT429/CMPE382 Amara, 11 Exception Behavior • Preserving exception behavior => any changes in instruction execution order must not change how exceptions are raised in program (=> no new exceptions) • Example: DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2) L1: • Problem with moving LW before BEQZ? 1/17/01 CMPUT429/CMPE382 Amara, 12 Static Branch Prediction • Simplest: Predict taken – average misprediction rate = untaken branch frequency, which for the SPEC programs is 34%. – Unfortunately, the misprediction rate ranges from not very accurate (59%) to highly accurate (9%) • Predict on the basis of branch direction? – choosing backward-going branches to be taken (loop) – forward-going branches to be not taken (if) – SPEC programs, however, most forward-going branches are taken => predict taken is better • Predict branches on the basis of profile information collected from earlier runs – Misprediction varies from 5% to 22% 1/17/01 CMPUT429/CMPE382 Amara, 13 Running Example • This code, add a scalar to a vector: for (i=1000; i>0; i=i–1) x[i] = x[i] + s; • Assume following latencies for all examples Instruction producing result FP ALU op FP ALU op Load double Load double Integer op 1/17/01 Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Execution in cycles 4 3 1 1 1 Latency in cycles 3 2 1 0 0 CMPUT429/CMPE382 Amara, 14 FP Loop: Where are the Hazards? • First translate into MIPS code: -To simplify, assume 8 is lowest address Loop: 1/17/01 L.D ADD.D S.D DSUBUI BNEZ NOP F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,8 R1,Loop ;F0=vector element ;add scalar from F2 ;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot CMPUT429/CMPE382 Amara, 15 FP Loop Showing Stalls 1 Loop: L.D 2 stall 3 ADD.D 4 stall 5 stall 6 S.D 7 DSUBUI 8 BNEZ 9 stall Instruction producing result FP ALU op FP ALU op Load double F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar in F2 0(R1),F4 ;store result R1,R1,8 ;decrement pointer 8B (DW) R1,Loop ;branch R1!=zero ;delayed branch slot Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 • 9 clocks: Rewrite code to minimize stalls? 1/17/01 CMPUT429/CMPE382 Amara, 16 Revised FP Loop Minimizing Stalls 1 Loop: L.D 2 stall 3 ADD.D 4 DSUBUI 5 BNEZ 6 S.D F0,0(R1) F4,F0,F2 R1,R1,8 R1,Loop ;delayed branch 8(R1),F4 ;altered when move past DSUBUI Swap BNEZ and S.D by changing address of S.D Instruction producing result FP ALU op FP ALU op Load double Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 6 clocks, but just 3 for execution, 3 for loop overhead; How make faster? 1/17/01 CMPUT429/CMPE382 Amara, 17 Unroll Loop Four Times (straightforward way) 1 Loop:L.D 3 ADD.D 6 S.D 7 L.D 9 ADD.D 12 S.D 13 L.D 15 ADD.D 18 S.D 19 L.D 21 ADD.D 24 S.D 25 DSUBUI 26 BNEZ 27 NOP 1/17/01 F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8 F10,-16(R1) F12,F10,F2 -16(R1),F12 F14,-24(R1) F16,F14,F2 -24(R1),F16 R1,R1,#32 R1,LOOP 1 cycle stall 2 cycles stall ;drop DSUBUI & Rewrite loop to minimize stalls? BNEZ ;drop DSUBUI & BNEZ ;drop DSUBUI & BNEZ ;alter to 4*8 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4 CMPUT429/CMPE382 Amara, 18 Unrolled Loop Detail • Do not usually know upper bound of loop • Suppose it is n, and we would like to unroll the loop to make k copies of the body • Instead of a single unrolled loop, we generate a pair of consecutive loops: – 1st executes (n mod k) times and has a body that is the original loop – 2nd is the unrolled body surrounded by an outer loop that iterates (n/k) times – For large values of n, most of the execution time will be spent in the unrolled loop 1/17/01 CMPUT429/CMPE382 Amara, 19 Unrolled Loop That Minimizes Stalls 1 Loop:L.D 2 L.D 3 L.D 4 L.D 5 ADD.D 6 ADD.D 7 ADD.D 8 ADD.D 9 S.D 10 S.D 11 S.D 12 DSUBUI 13 BNEZ 14 S.D F0,0(R1) • What assumptions F6,-8(R1) made when moved F10,-16(R1) code? F14,-24(R1) F4,F0,F2 – OK to move store past F8,F6,F2 DSUBUI even though the F12,F10,F2 store changes register F16,F14,F2 – OK to move loads before 0(R1),F4 stores: get right data? -8(R1),F8 – When is it safe for -16(R1),F12 compiler to do such R1,R1,#32 changes? R1,LOOP 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration 1/17/01 CMPUT429/CMPE382 Amara, 20 Compiler Perspectives on Code Movement • Compiler concerned about dependencies in program • Existence of a Hardware hazard depends on pipeline • Try to schedule to avoid hazards that cause performance losses • (True) Data dependencies (RAW) – Instruction i produces a result used by instruction j, or – Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory (“memory disambiguation” problem): – Does 100(R4) = 20(R6)? – From different loop iterations, does 20(R6) = 20(R6)? 1/17/01 CMPUT429/CMPE382 Amara, 21 Where are the name dependencies? 1 Loop:L.D 3 ADD.D 6 S.D 7 L.D 9 ADD.D 12 S.D 13 L.D 15 ADD.D 18 S.D 19 L.D 21 ADD.D 24 S.D 25 DSUBUI 26 BNEZ 27 NOP 1/17/01 F0,0(R1) F4,F0,F2 0(R1),F4 F0,-8(R1) F4,F0,F2 -8(R1),F4 F0,-16(R1) F4,F0,F2 -16(R1),F4 F0,-24(R1) F4,F0,F2 -24(R1),F4 R1,R1,#32 R1,LOOP CMPUT429/CMPE382 Amara, 22 Where are the name dependencies? 1 Loop:L.D 3 ADD.D 6 S.D 7 L.D 9 ADD.D 12 S.D 13 L.D 15 ADD.D 18 S.D 19 L.D 21 ADD.D 24 S.D 25 DSUBUI 26 BNEZ 27 NOP F0,0(R1) F4,F0,F2 0(R1),F4 F0,-8(R1) F4,F0,F2 -8(R1),F4 F0,-16(R1) F4,F0,F2 -16(R1),F4 F0,-24(R1) F4,F0,F2 -24(R1),F4 R1,R1,#32 R1,LOOP How can remove them? 1/17/01 CMPUT429/CMPE382 Amara, 23 Where are the name dependencies? 1 Loop:L.D 3 ADD.D 6 S.D 7 L.D 9 ADD.D 12 S.D 13 L.D 15 ADD.D 18 S.D 19 L.D 21 ADD.D 24 S.D 25 DSUBUI 26 BNEZ 27 NOP F0,0(R1) F4,F0,F2 0(R1),F4 F6,-8(R1) F8,F6,F2 -8(R1),F8 F10,-16(R1) F12,F10,F2 -16(R1),F12 F14,-24(R1) F16,F14,F2 -24(R1),F16 R1,R1,#32 R1,LOOP The Original“register renaming” 1/17/01 CMPUT429/CMPE382 Amara, 24 Compiler Perspectives on Code Movement • Name Dependencies are Hard to discover for Memory Accesses – Does 100(R4) = 20(R6)? – From different loop iterations, does 20(R6) = 20(R6)? • Our example required compiler to know that if R1 doesn’t change then: 0(R1) -8(R1) -16(R1) -24(R1) There were no dependencies between some loads and stores so they could be moved by each other 1/17/01 CMPUT429/CMPE382 Amara, 25 Steps Compiler Performs to Unroll • Check if it is OK to move the S.D after DSUBUI and BNEZ, and find amount to adjust S.D offset • Determine that unrolling the loop is useful by finding that the loop iterations are independent • Rename registers to avoid name dependencies • Eliminate extra test and branch instructions and adjust the loop termination and iteration code • Determine that loads and stores in unrolled loop can be interchanged because loads and stores from different iterations are independent – requires analyzing memory addresses and finding that they do not refer to the same address. • Schedule the code, preserving any dependencies needed to yield the same result as the original code 1/17/01 CMPUT429/CMPE382 Amara, 26 When a Loop is Parallel? • Example: Where are the data dependencies? (Assume that A,B, C are distinct and non-overlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } 1. S2 uses the value, A[i+1], computed by S1 in the same iteration. 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 used the value B[i] computed by S2 in the previous iteration. This is a “loop-carried dependence”. 1 S1 1/17/01 1 0 S2 Typically, when the loop dependency graph has a cycle with dependence distance 1, a loop is not parallel. CMPUT429/CMPE382 Amara, 27 How to find dependences? • One way to find dependences is to unroll the loop and find the RAW, WAR, and WAW dependences in the unrolled loop. • Example: Where are the data dependencies? (Assume that A,B, C are distinct and non-overlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } A[1] B[1] A[2] B[2] A[3] B[3] 1/17/01 = = = = = = A[0] B[0] A[1] B[1] A[2] B[2] + + + + + + C[0]; A[1]; C[1]; A[2]; C[2]; A[3]; /* /* /* /* /* /* S1 S2 S1 S2 S1 S2 */ */ */ */ */ */ /* S1 */ /* S2 */ Iteration i=0 Iteration i=1 Iteration i=2 CMPUT429/CMPE382 Amara, 28 When a Loop is Parallel? • Example: How about this loop? Is it parallel? (Assume A,B,C, and D are distinct and non-overlapping) for (i=1; i<100; i=i+1) { A[i] = A[i] + B[i]; B[i+1] = C[i] + D[i]; } 1/17/01 C ••• A ••• B ••• /* S1 */ /* S2 */ CMPUT429/CMPE382 Amara, 29 When a Loop is Parallel? • Example: Where are data dependencies? (Assume A,B,and C are distinct and non-overlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } C /* S1 */ /* S2 */ ••• S1 1/17/01 A ••• B ••• CMPUT429/CMPE382 Amara, 30 When a Loop is Parallel? • Example: Where are data dependencies? (Assume A,B, and C are distinct and non-overlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } C ••• A ••• /* S1 */ /* S2 */ S2 B 1/17/01 ••• CMPUT429/CMPE382 Amara, 31 When a Loop is Parallel? • Example: Where are data dependencies? (Assume A,B, and C are distinct and nonoverlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } C /* S1 */ /* S2 */ ••• S1 1/17/01 A ••• B ••• CMPUT429/CMPE382 Amara, 32 When a Loop is Parallel? • Example: Where are data dependencies? (Assume A,B, and C are distinct and nonoverlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; B[i+1] = B[i] + A[i+1]; } C ••• A ••• /* S1 */ /* S2 */ S2 B 1/17/01 ••• CMPUT429/CMPE382 Amara, 33 When a Loop is Parallel? • Example: How about this loop? Is it parallel? (Assume A,B,C, and D are distinct and non-overlapping) for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; B[i+1] = C[i] + D[i]; } The only cycle in the loop has a dependence distance of zero. Thus we should be able to parallelize the loop. 0 S1 1/17/01 /* S1 */ /* S2 */ 1 S2 But the loop has a loop-carried dependence, thus it seems that iteration i+1 cannot execute until iteration i finishes. CMPUT429/CMPE382 Amara, 34 Loop Parallelization? • A good compiler will do the following code transformation: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Now there are no more loop carried dependences, and all iterations can execute in parallel if the processor has enough functional units. 1/17/01 CMPUT429/CMPE382 Amara, 35 Hardware Support for Exposing More Parallelism at Compile-Time • Conditional or Predicated Instructions – Conditional instruction execution First Instruction Slot Second Instruction Slot LW R1, 40(R2) ADD R3, R4, R5 ADD R6, R3, R7 BEQZ R10, L LW R8, 0(R10) LW R9, 0(R8) • Waste slot since 3rd LW dependent on result of 2nd LW 1/17/01 CMPUT429/CMPE382 Amara, 36 Hardware Support for Exposing More Parallelism at Compile-Time • Use predicated version load word (LWC)? – load occurs unless the third operand is 0 First Instruction Slot Second Instruction Slot LW R1, 40(R2) ADD R3, R4, R5 LWC R8, 20(R10),R10 ADD R6, R3, R7 BEQZ R10, L LW R9, 0(R8) • If the sequence following the branch were short, the entire block of code might be converted to predicated execution, and the branch eliminated 1/17/01 CMPUT429/CMPE382 Amara, 37 Exception Behavior Support • Several mechanisms to ensure that speculation by compiler does not violate exception behavior – For example, cannot raise exceptions in predicated instruction is squashed – Prefetch does not cause exceptions 1/17/01 CMPUT429/CMPE382 Amara, 38 What if We Could Change the Instruction Set? • Superscalar processors decide on the fly how many instructions to issue – HW complexity of Number of instructions to issue O(n2) • Why not allow compiler to schedule instruction level parallelism explicitly? • Format the instructions in a potential issue packet so that HW need not check explicitly for dependences 1/17/01 CMPUT429/CMPE382 Amara, 39 VLIW: Very Large Instruction Word • Each “instruction” has explicit coding for multiple operations – In IA-64, grouping called a “bundle” – In Transmeta, grouping called a “molecule” (with “atoms” as ops) • Tradeoff instruction space for simple decoding – The long instruction word has room for many operations – By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel – E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch » 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide – Need compiling technique that schedules across several branches 1/17/01 CMPUT429/CMPE382 Amara, 40 Example of a VLIW Architecture: IA-64. Suggested Reading Intel IA-64 Architecture Software Developer’s Manual, Chapters 8, 9 1/17/01 CMPUT429/CMPE382 Amara, 41 IA-64 Instruction Group An instruction group is a set of instructions that have no read after write (RAW) or write after write (WAW) register dependencies. Consecutive instruction groups are separated by stops (represented by a double semi-column in the assembly code). ld8 sub add st8 1/17/01 r1=[r5] r6=r8, r9 r3=r2,r4 ;; [r6]=r12 // First group // First group // First group // Second group CMPUT429/CMPE382 Amara, 42 Instruction Bundles Instructions are organized in bundles of three instructions, with the following format: 127 8786 54 0 instruction slot 2 instruction slot 1 instruction slot 0 template 41 41 41 5 Instruction Description A I Integer ALU Non-ALU integer Memory Floating-Point Branch Extended M F B L+X 1/17/01 46 45 Execution Unit Type I-unit or M-unit I-unit M-unit F-unit B-unit I-unit/B-unit CMPUT429/CMPE382 Amara, 43 Bundles In assembly, each 128-bit bundle is enclosed in curly braces and contains a template specification { .mii ld4 add add r28=[r8] // Load a 4-byte value r9=2,r1 // 2+r1 and put in r9 r30=1,r1 // 1+r1 and put in r30 } An instruction group can extend over an arbitrary number of bundles. 1/17/01 CMPUT429/CMPE382 Amara, 44 Templates There are restrictions on the type of instructions that can be bundled together. The IA-64 has five slot types (M, I, F, B, and L), six instruction types (M, I, A, F, B, L), and twelve basic template types (MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, and MFB). The underscore in the bundle accronym indicates a stop. Every basic bundle type has two versions: one with a stop at the end of the bundle and one without. 1/17/01 CMPUT429/CMPE382 Amara, 45 Control Dependency Preventing Code Motion In the code below the ld4 is control dependent on the branch, and thus cannot be safely moved up in conventional processor architectures. add r7=r6,1 add r13=r25, r27 cmp.eq p1, p2=r12, r23 (p1) br. cond some_label ;; ld4 sub 1/17/01 r2=[r3] ;; r4=r2, r11 // cycle 0 block A br block B ld // cycle 1 // cycle 3 CMPUT429/CMPE382 Amara, 46 Control Speculation In the following code, suppose a load latency of two cycles (p1) br.cond.dptk L1 ld8 r3=[r5] ;; shr r7=r3,r87 // cycle 0 // cycle 1 // cycle 3 However, if we execute the load before we know if we actually have to do it (control speculation), we get: ld8.s r3=[r5] // earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1 1/17/01 CMPUT429/CMPE382 Amara, 47 Control Speculation The ld8.s instruction is a speculative load, and the chk.s instruction is a check instruction that verifies if the value loaded is still good. ld8.s r3=[r5] // earlier cycle // other, unrelated instructions (p1) br.cond.dptk L1 ;; // cycle 0 chk.s r3, recovery // cycle 1 shr r7=r3,r87 // cycle 1 1/17/01 CMPUT429/CMPE382 Amara, 48 Ambiguous Memory Dependencies An ambiguous memory dependency is a dependence between a load and a store, or between two stores where it cannot be determined if the instructions involved access overlapping memory locations. Two or more memory references are independent if it is known that they access non-overlapping memory locations. 1/17/01 CMPUT429/CMPE382 Amara, 49 Data Speculation An advanced load allows a load to be moved above a store even if it is not known wether the load and the store may reference overlapping memory locations. st8 ld8 shr [r55]=r45 r3=[r5] ;; r7=r3,r87 // cycle 0 // cycle 0 // cycle 2 ld8.a r3=[r5] ;; // Advanced Load // other, unrelated instructions st8 [r55]=r45 // cycle 0 ld8.c r3=[r5] ;; // cycle 0 - check shr r7=r3,r87 // cycle 0 1/17/01 CMPUT429/CMPE382 Amara, 50 Moving Up Loads + Uses: Recovery Code Original Code Speculative Code st8 ld8 add st8 [r4] = r12 r6 = [r8] ;; r5 = r6,r7 [r18] = r5 ld8.a r6 = [r8] ;; // cycle -3 // other, unrelated instructions add r5 = r6,r7 // cycle -1; add that uses r6 // other, unrelated instructions st8 [r4]=r12 // cycle 0 chk.a r6, recover // cycle 0: check back: // Return point from jump to recover st8 [r18] = r5 // cycle 0 recover: ld8 r6 = [r8] ;; add r5 = r6,r7 br back 1/17/01 // cycle 0: ambiguous store // cycle 0: load to advance // cycle 2 // cycle 3 // Reload r6 from [r8] // Re-execute the add // Jump back to main code CMPUT429/CMPE382 Amara, 51 ld.c, chk.a and the ALAT The execution of an advanced load, ld.a, creates an entry in a hardware structure, the Advanced Load Address Table (ALAT). This table is indexed by the register number. Each entry records the load address, the load type, and the size of the load. When a check is executed, the entry for the register is checked to verify that a valid enter with the type specified is there. 1/17/01 CMPUT429/CMPE382 Amara, 52 ld.c, chk.a and the ALAT Entries are removed from the ALAT when: (1) A store overlaps with the memory locations specified in the ALAT entry; (2) Another advanced load to the same register is executed; (3) There is a context switch caused by the operating system (or hardware); (4) Capacity limitation of the ALAT implementation requires reuse of the entry. 1/17/01 CMPUT429/CMPE382 Amara, 53 Not a Thing (NaT) The IA-64 has 128 general purpose registers, each with 64+1 bits, and 128 floating point registers, each with 82 bits. The extra bit in the GPRs is the NaT bit that is used to indicate that the content of the register is not valid. NaT=1 indicates that an instruction that generated an exception wrote to the register. It is a way to defer exceptions caused by speculative loads. Any operation that uses NaT as an operand results in NaT. 1/17/01 CMPUT429/CMPE382 Amara, 54 If-conversion If-conversion uses predicates to transform a conditional code into a single control stream code. if(r4) { add r1= r2, r3 ld8 r6=[r5] } if(r1) r2 = r3 + r3 else r7 = r6 - r5 1/17/01 cmp.ne (p1) add (p1) ld8 p1, p0=r4, 0 ;; Set predicate reg r1=r2, r3 r6=[r5] cmp.ne (p1) add (p2) sub p1, p2 = r1, 0 ;; Set predicate reg r2 = r3, r4 r7 = r6,r5 CMPUT429/CMPE382 Amara, 55 Trace Scheduling • Two steps: – Trace Selection » Find likely sequence of basic blocks (trace) of (statically predicted or profile predicted) long sequence of straight-line code – Trace Compaction » Squeeze trace into few VLIW instructions » Need bookkeeping code in case prediction is wrong • This is a form of compiler-generated speculation – Compiler must generate recovery code to handle cases in which execution does not go according to speculation. – Needs extra registers: undo bad guesses by discarding • Subtle compiler bugs may result in wrong answer: no hardware interlocks 1/17/01 CMPUT429/CMPE382 Amara, 56 Superscalar v. VLIW • Smaller code size • Binary compatibility across generations of hardware 1/17/01 • Simplified Hardware for decoding, issuing instructions • No Interlock Hardware (compiler checks?) • More registers, but simplified Hardware for Register Ports (multiple independent register files?) CMPUT429/CMPE382 Amara, 57 Problems with First Generation VLIW • Increase in code size – generating enough operations in a straight-line code fragment requires ambitiously unrolling loops – whenever VLIW instructions are not full, unused functional units translate to wasted bits in instruction encoding • Operated in lock-step; no hazard detection HW – a stall in any functional unit pipeline caused entire processor to stall, since all functional units must be kept synchronized – Compiler might prediction function units, but caches hard to predict • Binary code compatibility – Pure VLIW => different numbers of functional units and unit latencies require different versions of the code 1/17/01 CMPUT429/CMPE382 Amara, 58 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • IA-64: instruction set architecture; EPIC is type – EPIC = 2nd generation VLIW? • Itanium™ is name of first implementation (2001) – Highly parallel and deeply pipelined hardware at 800Mhz – 6-wide, 10-stage pipeline at 800Mhz on 0.18 µ process • 128 64-bit integer registers + 128 82-bit floating point registers – Not separate register files per functional unit as in old VLIW • Hardware checks dependencies (interlocks => binary compatibility over time) • Predicated execution (select 1 out of 64 1-bit flags) => 40% fewer mispredictions? 1/17/01 CMPUT429/CMPE382 Amara, 59 IA-64 Registers • The integer registers are configured to help accelerate procedure calls using a register stack – mechanism similar to that developed in the Berkeley RISC-I processor and used in the SPARC architecture. – Registers 0-31 are always accessible and addressed as 0-31 – Registers 32-128 are used as a register stack and each procedure is allocated a set of registers (from 0 to 96) – The new register stack frame is created for a called procedure by renaming the registers in hardware; – a special register called the current frame pointer (CFM) points to the set of registers to be used by a given procedure • 8 64-bit Branch registers used to hold branch destination addresses for indirect branches • 64 1-bit predict registers 1/17/01 CMPUT429/CMPE382 Amara, 60 IA-64 Registers • Both the integer and floating point registers support register rotation for registers 32-128. • Register rotation is designed to ease the task of allocating of registers in software pipelined loops • When combined with predication, possible to avoid the need for unrolling and for separate prologue and epilogue code for a software pipelined loop – makes the SW-pipelining usable for loops with smaller numbers of iterations, where the overheads would traditionally negate many of the advantages 1/17/01 CMPUT429/CMPE382 Amara, 61 Intel/HP IA-64 “Explicitly Parallel Instruction Computer (EPIC)” • Instruction group: a sequence of consecutive instructions with no register data dependences – All the instructions in a group could be executed in parallel, if sufficient hardware resources existed and if any dependences through memory were preserved – An instruction group can be arbitrarily long, but the compiler must explicitly indicate the boundary between one instruction group and another by placing a stop between 2 instructions that belong to different groups • IA-64 instructions are encoded in bundles, which are 128 bits wide. – Each bundle consists of a 5-bit template field and 3 instructions, each 41 bits in length • 3 Instructions in 128 bit “groups”; field determines if instructions dependent or independent – Smaller code size than old VLIW, larger than x86/RISC – Groups can be linked to show independence > 3 instr 1/17/01 CMPUT429/CMPE382 Amara, 62 5 Types of Execution in Bundle Execution Instruction Instruction Unit Slot type Description I-unit A Integer ALU I Non-ALU Int M-unit A Integer ALU M Memory access F-unit F Floating point B-unit B Branches L+X L+X Extended Example Instructions add, subtract, and, or, cmp shifts, bit tests, moves add, subtract, and, or, cmp Loads, stores for int/FP regs Floating point instructions Conditional branches, calls Extended immediates, stops • 5-bit template field within each bundle describes both the presence of any stops associated with the bundle and the execution unit type required by each instruction within the bundle (see Fig 4.12 page 271) CMPUT429/CMPE382 1/17/01 Amara, 63 Itanium™ Processor Silicon (Copyright: Intel at Hotchips ’00) IA-32 Control FPU IA-64 Control Integer Units Instr. Fetch & Decode Cache TLB Cache Bus Core Processor Die 1/17/01 4 x 1MB L3 cache CMPUT429/CMPE382 Amara, 64 Itanium™ Machine Characteristics (Copyright: Intel at Hotchips ’00) Frequency 800 MHz Transistor Count 25.4M CPU; 295M L3 Process 0.18u CMOS, 6 metal layer Package Organic Land Grid Array Machine Width 6 insts/clock (4 ALU/MM, 2 Ld/St, 2 FP, 3 Br) Registers 14 ported 128 GR & 128 FR; 64 Predicates Speculation 32 entry ALAT, Exception Deferral Branch Prediction Multilevel 4-stage Prediction Hierarchy FP Compute Bandwidth 3.2 GFlops (DP/EP); 6.4 GFlops (SP) Memory -> FP Bandwidth 4 DP (8 SP) operands/clock Virtual Memory Support 64 entry ITLB, 32/96 2-level DTLB, VHPT L2/L1 Cache Dual ported 96K Unified & 16KD; 16KI L2/L1 Latency 6 / 2 clocks L3 Cache 4MB, 4-way s.a., BW of 12.8 GB/sec; System Bus 2.1 GB/sec; 4-way Glueless MP Scalable to large (512+ proc) systems 1/17/01 CMPUT429/CMPE382 Amara, 65 Itanium™ EPIC Design Maximizes SW-HW Synergy (Copyright: Intel at Hotchips ’00) Architecture Features programmed by compiler: Branch Hints Explicit Parallelism Register Data & Control Stack Predication Speculation & Rotation Memory Hints Micro-architecture Features in hardware: 1/17/01 Fast, Simple 6-Issue Instruction Cache & Branch Predictors Issue Register Handling 128 GR & 128 FR, Register Remap & Stack Engine Control Parallel Resources Bypasses & Dependencies Fetch 4 Integer + 4 MMX Units Memory Subsystem 2 FMACs (4 for SSE) Three levels of cache: 2 L.D/ST units L1, L2, L3 32 entry ALAT Speculation Deferral Management CMPUT429/CMPE382 Amara, 66 10 Stage In-Order Core Pipeline (Copyright: Intel at Hotchips ’00) Front End Execution • Pre-fetch/Fetch of up to 6 instructions/cycle • Hierarchy of branch predictors • Decoupling buffer • • • • EXPAND IPG INST POINTER GENERATION FET ROT EXP FETCH ROTATE Instruction Delivery • Dispersal of up to 6 instructions on 9 ports • Reg. remapping • Reg. stack engine 1/17/01 RENAME REN 4 single cycle ALUs, 2 ld/str Advanced load control Predicate delivery & branch Nat/Exception//Retirement WORD-LINE REGISTER READ DECODE WL.D REG EXE EXECUTE DET WRB EXCEPTION WRITE-BACK DETECT Operand Delivery • Reg read + Bypasses • Register scoreboard • Predicated dependencies CMPUT429/CMPE382 Amara, 67 Itanium processor 10-stage pipeline • Front-end (stages IPG, Fetch, and Rotate): prefetches up to 32 bytes per clock (2 bundles) into a prefetch buffer, which can hold up to 8 bundles (24 instructions) – Branch prediction is done using a multilevel adaptive predictor like P6 microarchitecture • Instruction delivery (stages EXP and REN): distributes up to 6 instructions to the 9 functional units – Implements registers renaming for both rotation and register stacking. 1/17/01 CMPUT429/CMPE382 Amara, 68 Itanium processor 10-stage pipeline • Operand delivery (WLD and REG): accesses register file, performs register bypassing, accesses and updates a register scoreboard, and checks predicate dependences. – Scoreboard used to detect when individual instructions can proceed, so that a stall of 1 instruction in a bundle need not cause the entire bundle to stall • Execution (EXE, DET, and WRB): executes instructions through ALUs and load/store units, detects exceptions and posts NaTs, retires instructions and performs write-back – Deferred exception handling for speculative instructions is supported by providing the equivalent of poison bits, called NaTs for Not a Thing, for the GPRs (which makes the GPRs effectively 65 bits wide), and NaT Val (Not a Thing Value) for FPRs (already 82 bits wides) 1/17/01 CMPUT429/CMPE382 Amara, 69 Comments on Itanium • Remarkably, the Itanium has many of the features more commonly associated with the dynamically-scheduled pipelines – strong emphasis on branch prediction, register renaming, scoreboarding, a deep pipeline with many stages before execution (to handle instruction alignment, renaming, etc.), and several stages following execution to handle exception detection • Surprising that an approach whose goal is to rely on compiler technology and simpler HW seems to be at least as complex as dynamically scheduled processors! 1/17/01 CMPUT429/CMPE382 Amara, 70 Peformance of IA-64 Itanium (Source: Microprocessor Report Jan 2002) • • • • • • • • • 1/17/01 ITANIUM (800 MHz): SPECint2000(base): 358 SPECfp2000(base): 703 POWER4 (1.3 GHz): SPECint2000(base): 790 SPECfp2000(base): 1,098 SUN UltraSPARC III (1.05 GHz) SPECint2000(base): 537 SPECfp2000(base): 701 CMPUT429/CMPE382 Amara, 71 Summary#1: Hardware versus Software Speculation Mechanisms • To speculate extensively, must be able to disambiguate memory references – Much easier in HW than in SW for code with pointers • HW-based speculation works better when control flow is unpredictable, and when HW-based branch prediction is superior to SW-based branch prediction done at compile time – Mispredictions mean wasted speculation • HW-based speculation maintains precise exception model even for speculated instructions • HW-based speculation does not require compensation or bookkeeping code 1/17/01 CMPUT429/CMPE382 Amara, 72 Summary#2: Hardware versus Software Speculation Mechanisms cont’d • Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling • HW-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of an architecture – may be the most important in the long run? 1/17/01 CMPUT429/CMPE382 Amara, 73 Summary #3: Software Scheduling • Instruction Level Parallelism (ILP) found either by compiler or hardware. • Loop level parallelism is easiest to see – SW dependencies/compiler sophistication determine if compiler can unroll loops – Memory dependencies hardest to determine => Memory disambiguation – Very sophisticated transformations available • Trace Scheduling to Parallelize If statements • Superscalar and VLIW: CPI < 1 (IPC > 1) – Dynamic issue vs. Static issue – More instructions issue at same time => larger hazard penalty – Limitation is often number of instructions that you can successfully fetch and decode per cycle 1/17/01 CMPUT429/CMPE382 Amara, 74