Lecture 5: Introduction to Advanced Pipelining L.N. Bhuyan CS 162 DAP.F96 1 Arithmetic Pipeline • The floating point executions cannot be performed in one cycle during the EX stage. Allowing much more time will increase the pipeline cycle time or subsequent instructions have to be stalled • Solution is to break the FP EX stage to several stages whose delay can match the cycle time of the instruction pipeline • Such a FP or arithmetic pipeline does not reduce latency, but can decouple from the integer unit and increase throughput for a sequence of FP instructions • What is a vector instruction and or a vector computer? DAP.F96 2 MIPS R4000 Floating Point • FP Adder, FP Multiplier, FP Divider • Last step of FP Multiplier/Divider uses FP Adder HW • 8 kinds of stages in FP units: Stage A D E M N R S U Functional unit FP adder FP divider FP multiplier FP multiplier FP multiplier FP adder FP adder Description Mantissa ADD stage Divide pipeline stage Exception test stage First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers DAP.F96 3 MIPS FP Pipe Stages FP Instr Add, Subtract Multiply Divide Square root Negate Absolute value FP compare Stages: M N R S U 1 U U U U U U U 2 S+A E+M A E S S A 3 4 A+R R+S M M R D28 (A+R)108 7 8 … 5 6 M … … N N+A R D+A D+R, D+R, D+A, D+R, A, R A R R First stage of multiplier Second stage of multiplier Rounding stage Operand shift stage Unpack FP numbers A D E Mantissa ADD stage Divide pipeline stage Exception test stage DAP.F96 4 Pipeline with Floating point operations • Example of FP pipeline integrated with the instruction pipeline: Fig. A.31, A.32 and A.33 distributed in the class • The FP pipeline consists of one integer unit with 1 stage, one FP/integer multiply unit with 7 stages, one FP adder with 4 stages, and a FP/integer divider with 24 stages • A.31 shows the pipeline, A.32 shows execution of independent instns, and A.33 shows effect of data dependency • Impact of data dependency is severe. Possibility of out-of-order execution => creates different hazards to be considered later DAP.F96 5 R4000 Performance Base Load stalls Branch stalls FP result stalls tomcatv su2cor spice2g6 ora nasa7 doduc li gcc espresso eqntott • Not ideal CPI of 1: – Load stalls (1 or 2 clock cycles) – Branch stalls (2 cycles + unfilled slots) – FP result stalls: RAW data hazard (latency) – FP structural stalls: Not enough FP hardware (parallelism) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 FP structural stalls DAP.F96 6 FP Loop: Where are the Hazards? Loop: LD ADDD SD SUBI BNEZ NOP F0,0(R1) F4,F0,F2 0(R1),F4 R1,R1,8 R1,Loop Instruction producing result FP ALU op FP ALU op Load double Load double Integer op • ;F0=vector element ;add scalar from F2 ;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot Instruction using result Another FP ALU op Store double FP ALU op Store double Integer op Where are the stalls? Latency in clock cycles 3 2 1 0 0 DAP.F96 7 FP Loop Showing Stalls 1 Loop: LD 2 stall 3 ADDD 4 stall 5 stall 6 SD 7 SUBI 8 BNEZ 9 stall Instruction producing result FP ALU op FP ALU op Load double F0,0(R1) ;F0=vector element F4,F0,F2 ;add scalar in F2 0(R1),F4 R1,R1,8 R1,Loop ;store result ;decrement pointer 8B (DW) ;branch R1!=zero ;delayed branch slot Instruction using result Another FP ALU op Store double FP ALU op Latency in clock cycles 3 2 1 • 9 clocks: Rewrite code to minimize stalls? DAP.F96 9 Minimizing Stalls Technique 1: Compiler Optimization 1 Loop: LD 2 stall 3 ADDD 4 SUBI 5 BNEZ 6 SD F0,0(R1) F4,F0,F2 R1,R1,8 R1,Loop 8(R1),F4 ;delayed branch ;altered when move past SUBI Swap BNEZ and SD by changing address of SD Instruction producing result FP ALU op FP ALU op Load double Instruction using result Another FP ALU op Store double FP ALU op 6 clocks Latency in clock cycles 3 2 1 DAP.F96 10 Technique 2: Loop Unrolling (Software Pipelining) 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 ;1 cycle delay * 3 SD 0(R1),F4 ;drop SUBI & BNEZ – 2cycles delay * 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 ; 1 cycle delay 6 SD -8(R1),F8 ;drop SUBI & BNEZ – 2 cycles delay 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 ; 1 cycle delay 9 SD -16(R1),F12 ;drop SUBI & BNEZ – 2 cycles delay 10 LD F14,-24(R1) 11 ADDD F16,F14,F2 ; 1 cycle delay 12 SD -24(R1),F16 ; 2 cycles daly 13 SUBI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP *1 cycle delay for FP operation after load. 2 cycles delay DAP.F96 11 for store after FP Minimize Stall + Loop Unrolling 1 Loop: 2 3 4 5 6 7 8 9 10 11 12 13 branch 14 LD LD LD LD ADDD ADDD ADDD ADDD SD SD SD SUBI BNEZ F0,0(R1) F6,-8(R1) F10,-16(R1) F14,-24(R1) F4,F0,F2 F8,F6,F2 F12,F10,F2 F16,F14,F2 0(R1),F4 -8(R1),F8 -16(R1),F12 R1,R1,#32 R1,LOOP SD 8(R1),F16 • What assumptions made when moved code? – OK to move store past SUBI even though changes register – OK to move loads before stores: get right data? – When is it safe for compiler to do such changes? ; Delayed ; 8-32 = -24 14 clock cycles, or 3.5 per iteration When safe to move instructions? DAP.F96 12 Compiler Perspectives on Code Movement • Definitions: compiler concerned about dependencies in program, whether or not a HW hazard depends on a given pipeline • Try to schedule to avoid hazards • (True) Data dependencies (RAW if a hazard for HW) – Instruction i produces a result used by instruction j, or – Instruction j is data dependent on instruction k, and instruction k is data dependent on instruction i. • If dependent, can’t execute in parallel • Easy to determine for registers (fixed names) • Hard for memory: – Does 100(R4) = 20(R6)? – From different loop iterations, does 20(R6) = 20(R6)? DAP.F96 13 Compiler Perspectives on Code Movement • Another kind of dependence called name dependence: two instructions use same name (register or memory location) but don’t exchange data • Antidependence (WAR if a hazard for HW) – Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first • Output dependence (WAW if a hazard for HW) – Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved. DAP.F96 14 RAW WAR WAW and RAW I1 I3 I5 I2 I4 I6 Program order EXAMPLE I1. Load R1, A /R1 Memory(A)/ I2. Add R2, R1 /R2 (R2)+(R1)/ I3. Add R3, R4 /R3 (R3)+(R4)/ I4. Mul R4, R5 /R4 (R4)*(R5)/ I5. Comp R6 /R6 Not(R6)/ I6. Mul R6, R7 /R6 (R6)*(R7)/ Output Flow Antidependencedependence dependence, also flow dependence Due to Superscalar Processing, it is possible that I4 completes before I3 starts. Similarly the value of R6 depends on the beginning and end of I5 and I6. Unpredictable result! A sample program and its dependence graph, where I2 and I3 share the adder and I4 and I6 share the same multiplier. These two dependences can be removed by duplicating the resources, or pipelined adders and multipliers. DAP.F96 15 Register Renaming Rewrite the previous program as: • I1. R1b Memory (A) • I2. R2b (R2a) + (R1b) • I3. R3b (R3a) + (R4a) • I4. R4b (R4a) * (R5a) • I5. R6b -(R6a) • I6. R6c (R6b) * (R7a) Allocate more registers and rename the registers that really do not have flow dependency. The WAR hazard between I3 and I4, and WAW hazard between I5 and I6 have been removed. These two hazards also called Name dependencies DAP.F96 16 Where are the name dependencies? 1 Loop: LD 2 ADDD 3 SD 4 LD 2 ADDD 3 SD 7 LD 8 ADDD 9 SD 10 LD 11 ADDD 12 SD 13 SUBI 14 BNEZ 15 NOP F0,0(R1) F4,F0,F2 0(R1),F4 F0,-8(R1) F4,F0,F2 -8(R1),F4 F0,-16(R1) F4,F0,F2 -16(R1),F4 F0,-24(R1) F4,F0,F2 -24(R1),F4 R1,R1,#32 R1,LOOP ;drop SUBI & BNEZ ;drop SUBI & BNEZ ;drop SUBI & BNEZ ;alter to 4*8 How can remove them? DAP.F96 17 Detailed Scoreboard Pipeline Control Instruction status Wait until Bookkeeping Issue Not busy (FU) and not result(D) Busy(FU) yes; Op(FU) op; Fi(FU) `D’; Fj(FU) `S1’; Fk(FU) `S2’; Qj Result(‘S1’); Qk Result(`S2’); Rj not Qj; Rk not Qk; Result(‘D’) FU; Read operands Rj and Rk Rj No; Rk No Execution complete Functional unit done Write result f((Fj( f )≠Fi(FU) f(if Qj(f)=FU then Rj(f) Yes); or Rj( f )=No) & f(if Qk(f)=FU then Rj(f) Yes); (Fk( f ) ≠Fi(FU) or Result(Fi(FU)) 0; Busy(FU) No Rk( f )=No)) DAP.F96 19