Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002 Revisiting Pipelining Lessons 6 PM 7 8 9 Time 30 40 T a s k A B O r d e r C D 40 40 40 20 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously using different resources • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup • Stall for Dependences Revisiting Pipelining Hazards • Structural Hazards – Hardware design • Control Hazard – Decision based on results • Data Hazard – Data Dependency Control Signals for existing Datapath IF: Instruction Fetch ID: Instruction Decode/ register file read EX: Execute/address calculation MEM: Memory Access WB: Write back ADD ADD 4 Shift left 2 Read Reg1 M U X P C Address Read Reg2 Zero ADD Instruction Instruction Memory Read Data1 Registers Read Data2 Write Reg M U X Address Read Data Data Memory Write Data Write Data 16 Sign Extend 32 The Right to Left Control can lead to hazards M U X Place registers between each step IF/ID ID/EX EX/MEM MEM/WB ADD ADD 4 Shift left 2 Read Reg1 M U X P C Address Read Reg2 Zero ADD Instruction Instruction Memory Read Data1 Registers Read Data2 Write Reg M U X Address Read Data Data Memory Write Data Write Data 16 Sign Extend 32 M U X Example 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Start: Fetch 10 n WB Ctrl A Exec im Reg File Mem Ctrl rs rt S M = PC 10 D Mem Acces s Data Mem B Next PC IR n Reg. File n Decode Inst. Mem n IF 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n WB Ctrl Mem Ctrl A S M Reg. File im Exec rt Reg File 2 = PC 14 D Mem Acces s Data Mem B Next PC IR n Decode lw r1, r2(35) Inst. Mem Fetch 14, Decode 10 ID 10 lw r1, r2(35) IF 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n WB Ctrl S M Reg. File r2 n Mem Ctrl Exec 35 Reg File lw r1 rt 2 = PC 20 D Mem Acces s Data Mem B Next PC IR Decode addI r2, r2, 3 Inst. Mem Fetch 20, Decode 14, Exec 10 EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n WB Ctrl D Reg. File M Mem Acces s Data Mem Mem Ctrl r2+35 Exec 3 Reg File r2 lw r1 addI r2, r2, 3 5 4 B PC 24 = Next PC IR Decode sub r3, r4, r5 Inst. Mem Fetch 24, Decode 20, Exec 14, Mem 10 M 10 lw EX 14 addI r2, r2, 3 ID 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 IF 100 and r1, r2(35) r13, r14, 15 lw r1 addI r2 sub r3 Mem Ctrl WB Ctrl M[r2+35] D Mem Acces s Data Mem Reg. File r2+3 r4 Exec 7 Reg File 6 r5 PC 30 = Next PC IR Decode beq r6, r7 100 Inst. Mem Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 WB 10 M 14 lw r1, r2(35) addI r2, r2, 3 EX 20 ID 24 sub r3, r4, r5 beq r6, r7, 100 IF 30 ori r8, r9, 17 add r10, r11, r12 34 100 and r13, r14, 15 r1=M[r2+35] WB Ctrl Reg. File addI r2 Mem Ctrl r2+3 sub r3 r4-r5 r6 Exec 100 Reg File beq xx 9 = PC 100 D Mem Acces s Data Mem r7 Next PC IR Decode ori r8, r9 17 Inst. Mem Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 10 WB 14 M 20 lw r1, r2(35) addI r2, r2, 3 sub r3, r4, r5 EX 24 ID 30 beq r6, r7, 100 ori r8, r9, 17 34 add r10, r11, r12 IF 100 and r13, r14, 15 Pipelining Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw Ifetch Reg/Dec 2nd lw Ifetch 3rd lw Exec Mem Wr Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr • The five independent functional units in the pipeline datapath are: – Instruction Memory for the Ifetch stage – Register File’s Read ports (bus A and busB) for the Reg/Dec stage – ALU for the Exec stage – Data Memory for the Mem stage – Register File’s Write port (bus W) for the Wr stage Pipelining the R Instruction Cycle 1 Cycle 2 R-type Ifetch Reg/Dec Cycle 3 Cycle 4 Exec Wr • Ifetch: Instruction Fetch – Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: – ALU operates on the two register operands – Update PC • Wr: Write the ALU output back to the register file Pipelining Both L and R type Cycle 1 Cycle 2 R-type Ifetch R-type Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Reg/Dec Exec Ifetch Reg/Dec Exec Ifetch Reg/Dec Load Ops! We have a problem! Wr R-type Ifetch Wr Exec Mem Wr Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr • We have pipeline conflict or structural hazard: – Two instructions try to write to the register file at the same time! – Only one write port Important Observations • Each functional unit can only be used once per instruction • Each functional unit must be used at the same stage for all instructions: – Load uses Register File’s Write Port during its 5th stage Load 1 2 3 4 5 Ifetch 1 Reg/Dec 2 Exec 3 Mem 4 Wr R-type Ifetch Reg/Dec Exec Wr – R-type uses Register File’s Write Port during its 4th stage Solution • Delay R-type’s register write by one cycle: – Now R-type instructions also use Reg File’s write port at Stage 5 – Mem stage is a NOOP stage: nothing is being done. 1 2 R-type Ifetch Cycle 1 Cycle 2 Reg/Dec 3 Exec 4 5 Mem Wr Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr Reg/Dec Exec Mem Wr Reg/Dec Exec Mem Load R-type Ifetch R-type Ifetch Wr Datapath (Without Pipeline) IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + SX; M <– Mem[S] Mem[S] <- B If Cond PC < PC+SX; Reg. File S B M D Mem Acces s Data Mem A Exec R[rd] <– M; IR Inst. Mem R[rt] <– S; PC Next PC R[rd] <– S; S <– A + SX; Equal S <– A or ZX; Reg File S <– A + B; Datapath (With Pipeline) IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] Mem[S] <- B A S M B D Reg. File R[rd] <– M; IR Inst. Mem R[rt] <– M; PC Next PC R[rd] <– M; M <– Mem[S] if Cond PC < PC+SX; Mem Acces s Data Mem M <– S S <– A + SX; Exec M <– S S <– A + SX; Equal S <– A or ZX; Reg File S <– A + B; Structural Hazard and Solution Time (clock cycles) Instr 4 Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Instr 3 Mem Mem ALU Instr 2 Reg ALU Instr 1 Mem ALU O r d e r Load ALU I n s t r. Mem Reg Control Hazard - #1 Stall Add Beq Reg Mem Mem Reg Reg Mem Lost potential Mem Reg Reg ALU Load Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg • Stall: wait until decision is clear • Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Control Hazard – #2 Predict Beq Load Reg Mem Mem Reg Reg Mem Reg Mem Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg • Predict: guess one direction then back up if wrong • Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right 50% of time) • More dynamic scheme: history of 1 branch Control Hazard - #3 Delayed Branch Misc Load Mem Mem Reg Reg Mem Reg Mem Reg Mem Reg Mem Reg ALU Beq Reg ALU Add Mem ALU O r d e r Time (clock cycles) ALU I n s t r. Mem Reg • Delayed Branch: Redefine branch behavior (takes place after next instruction) • Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time) Data Hazards (RAW) • Dependencies backwards in time are hazards or r8,r1,r9 xor r10,r1,r11 W B Reg Reg Dm Im Reg Dm Im Reg ALU and r6,r1,r7 Im ME M Dm ALU sub r4,r1,r3 E X ALU O r d e r add r1,r2,r3 ID/R FReg ALU I n s t r. Time (clock cycles) I F Im Reg Reg Dm Reg Data Hazards [contd…] • “Forward” result from one stage to another xor r10,r1,r11 Reg Dm Im Reg Dm Im Reg Dm Im Reg ALU or r8,r1,r9 W B Reg ALU and r6,r1,r7 Im ME M Dm ALU sub r4,r1,r3 E X ALU O r d e r add r1,r2,r3 ID/R FReg ALU I n s t r. Time (clock cycles) I F Im Reg Reg Reg Dm Reg Data Hazards [contd…] • Dependencies backwards in time are hazards sub r4,r1,r3 Stall ME M Dm W B Reg Im Reg ALU lw r1,0(r2) ID/R FReg ALU Time (clock cycles) I F Im E X Dm Reg • Can’t solve with forwarding: • Must delay/stall instruction dependent on loads Hazard Detection I-Fetch DCD MemOpFetch OpFetch IFetch DCD Exec Store °°° Structural Hazard I-Fetch DCD OpFetch Jump IFetch IF DCD EX IF Mem WB DCD EX IF DCD °°° RAW (read after write) Data Hazard Mem WB DCD EX Mem WB IF DCD IF Control Hazard DCD OF WAW Data Hazard (write after write) OF Ex RS Ex Mem WAR Data Hazard (write after read) Three Generic Data Hazards • Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 • Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. CPSC614 Lec 2.28 Three Generic Data Hazards • Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5 CPSC614 Lec 2.29 Three Generic Data Hazards • Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 • Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes CPSC614 Lec 2.30 Hazard Detection • Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. New Inst Instruction Movement: Inst I Inst J Window on execution: Only pending instructions can cause hazards • A RAW hazard exists on register if Rregs( i ) Wregs( j ) • A WAW hazard exists on register if Wregs( i ) Wregs( j ) • A WAR hazard exists on register if Wregs( i ) Rregs( j ) Computing CPI • Start with Base CPI • Add stalls CPI CPIbase CPI stall CPI stall STALLtype1 freq type1 STALLtype 2 freq type 2 •Suppose: –CPIbase=1 –Freqbranch=20%, freqload=30% –Suppose branches always cause 1 cycle stall –Loads cause a 2 cycle stall •Then: CPI = 1 + (10.20)+(2 0.30)= 1.8 Summary • Control Signals need to be propagated • Insert Registers between every stage to “remember” and “propagate” values • Solutions to Control Hazard are Stall, Predict and Delayed Branch • Solutions to Data Hazard is “Forwarding” • Effective CPI = CPIideal + CPIstall