MIT OpenCourseWare http://ocw.mit.edu 6.004 Computation Structures Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Pipeline Issues This pipeline stuff makes my head hurt! Recalling Data Hazards Maybe it’s that dumb hat PROBLEM: Subsequent instructions can reference the contents of a register well before the pipeline stage where the register is written. i IF ADD RF i+1 i+2 i+3 CMP MUL SUB ADD CMP MUL SUB ADD CMP MUL SUB ADD CMP MUL ALU WB Home Stretch: Lab #8 due Thursday 5/7; Quiz 5 FRIDAY 5/8! 5/5/09 6.004 – Spring 2009 modified 5/4/09 10:06 L23 – Pipeline Issues 1 RF IR RF RA1 SUB(r0,r4,r5) RD1 ALU IR Register File A ALU RA2 RD2 Bypass muxes B A i+4 i+5 i+6 SUB SOLUTION #1: Deal with it in SOFTWARE; expose the pipeline for all to see. SOLUTION #2: Add special hardware to maintain the sequential execution semantics of the ISA. 5/5/09 6.004 – Spring 2009 Bypass Paths IF ADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r3, 100, r4) SUB(r0, r4, r5) L23 – Pipeline Issues 2 Load Hazards Add special data paths, called BYPASSES, that route the results of the ALU and WB stages to the RF stage, thus substituting the register’s old contents with a value that will be written to that register at some point in the future. Consider LOADS: Can we fix all these problems using bypass paths? LD B LD(r1, 0, r4) ADD(r4, r1, r5) XOR(r3, r4, r6) ADD XOR LD ADD XOR LD ADD XOR LD ADD ALU MULC(r3,100,r4) Y Detection of these cases has to be incorporated into the decoding logic of the RF stage, which basically looks at the instructions in the ALU and WB stage to see if their destination register matches a source register reference. WB WB Y IR WB CMPLEC(r3,100,r0) WA Register File WD WE XOR The hazard between the XOR and the LD can be addressed by our established bypass paths… But there are some problems that BYPASSING CAN’T FIX! 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 3 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 4 ILL XAdr OP JT PCSEL 4 3 2 1 Load Hazard (easy) 0 IF 00 PC A D +4 Structural Data Hazard Instruction Memory Instruction Fetch RF RF IR PC Ra <20:16> + Rb: <15:11> The XOR hazard is pretty easy, but… Rc <25:21> RA2SEL C: <15:0> << 2 sign-extended RA1 WA RD1 XOR(r3, r4, r6) Z Register RA2 File RD2 Register File JT C: <15:0> sign-extended <PC>+C ASEL 1 ALU ALU IR PC 1 0 A BSEL 0 ALU D B A B ALU ALUFN Y MEM MEM MEM IR PC ALU MEM Y D Adr WD R/W Data Memory RD LD(r1, 0, r4) The XOR operand r4 can simply be bypassed from the output of the memory in the WB stage to the RF stage… by our normal bypass path. Rc <25:21> WASEL 0 1 0 WA 1 2 Write Back WDSEL RegisterWD File WE XOR LD ADD ??? LD XOR ADD XOR LD ADD XOR WERF 5/5/09 6.004 – Spring 2009 ADD How do we fix this one? In a 4-stage pipeline, for a LD instruction fetched during clock i, the data from memory isn’t returned from memory until late into cycle i+3. Bypassing can fix the XOR but not ADD! XP (NB: SAME RF AS ABOVE!) LD LD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) L23 – Pipeline Issues 5 5/5/09 6.004 – Spring 2009 L23 – Pipeline Issues 6 ILL XAdr OP JT PCSEL 4 3 2 1 Load Hazard (hard) 0 IF 00 PC A D +4 Instruction Fetch RF RF IR PC Ra <20:16> + Rb: <15:11> RA1 WA RD1 Z Register RA2 File RD2 C: <15:0> sign-extended ASEL ALU Register File JT <PC>+C PC Rc <25:21> RA2SEL C: <15:0> << 2 sign-extended ADD(r4, r1, r5) ALU IR 1 1 0 A 0 BSEL ALU D B A ALUFN LD(r1, 0, r4) B ALU Y MEM PC MEM ALU MEM IR D Adr WD R/W RD (NB: SAME RF AS ABOVE!) The r4 operand to the ADD instruction hasn’t yet been fetched from memory. 0 1 WA 0 1 2 RegisterWD File WE LD It exists NOWHERE on our data paths – we can’t solve this problem by bypassing! WDSEL LD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) ADD XOR XOR LD ADD ADD XOR LD NOP ADD XOR LD NOP ADD XOR If the compiler knows about a machine’s load delay, it can often rearrange code sequences to eliminate such hazards. Many compilers provide machine-specific instruction scheduling. XP Rc <25:21> WASEL Bypassing can’t fix the problem with ADD since the data simply isn’t available! We have to add some pipeline interlock hardware to stall ADD’s execution. MEM Y Data Memory 6.004 – Spring 2009 Load Delay Instruction Memory Write Back WERF 5/5/09 L23 – Pipeline Issues 7 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 8 ILL XAdr OP JT PCSEL 4 2 3 LE 1 Stall Logic 0 IF 00 PC STALL D +4 LE RF PC Memory Timing & Pipelining Instruction Memory A Instruction Fetch But, but, what about FASTER processors? RF IR LE Ra <20:16> Rb: <15:11> Rc <25:21> FACT: Processors have become very fast relative to memories! And this gap continues to grow… RA2SEL + C: <15:0> << 2 sign-extended RA1 WA RD1 ADD(r4, r1, r5) Z Register RA2 File RD2 Register File JT C: <15:0> sign-extended <PC>+C NOP AnnulRF 1 0 ASEL 1 ALU ALU IR PC 1 0 A A ALU D B ALU ALUFN LD(r1, 0, r4) NOP BSEL 0 B Y MEM MEM MEM IR PC ALU MEM Y D Adr WD (1) Freeze IF, RF stages (2) Introduce NOP into ALU stage (3) Wait until operand is available R/W Data Memory LD(r1, 0, r4) RD Rc <25:21> WASEL 0 1 0 WA 1 2 Write Back WDSEL RegisterWD File WE WERF 5/5/09 6.004 – Spring 2009 ALTERNATIVE: Longer pipelines. 1. Add “MEMORY WAIT” stages between START of read operation & return of data. 2. Build pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. These steps add load delay slots; hence 3. Stall pipeline on unbypassable load delays. A 4-Stage pipeline requires READ access in less than one clock. A 5-Stage pipeline would allow nearly two clocks... XP (NB: SAME RF AS ABOVE!) Do we just increase the clock period to accommodate this bottleneck component? L23 – Pipeline Issues 9 ILL XAdr OP JT PCSEL 4 3 5/5/09 6.004 – Spring 2009 L23 – Pipeline Issues 10 ILL XAdr OP JT 2 1 0 IF • Omits some detail • NO bypass or interlock logic 00 PC A Instruction Memory D +4 Instruction Fetch IRRF RF PC Ra <20:16> + Rb: <15:11> C: <15:0> << 2 sign-extended RA1 WA RD1 Z ALU IR 1 0 1 A • Omits some detail • NO bypass or interlock logic 00 A Instruction Memory D +4 Instruction Fetch NOP IRRF RF PC Register File 0 RA1 WA RD1 Z NOP ALU ALU D PC IR ASEL ALU IR 0 1 A MEM MEM PC R/W ALU D B MEM ALU MEM IR MEM Y D Memory Adr WB WB PC WB 0 1 WA 6.004 – Spring 2009 0 1 2 Data Memory RD Register WD File WE Write Back WDSEL WERF 5/5/09 R/W XP Data needed right before rising clock edge at end of Write Back pipe stage L23 – Pipeline Issues 11 Rc <25:21> WASEL 0 1 WA 6.004 – Spring 2009 • added IRIF mux to annul branch-slot instructions • added A/B bypass muxes to get data before it’s written to regfile Y almost 2 clock cycles RD XP Rc <25:21> WASEL WD WB IR We wanted a simple, clean pipeline but… BSEL ALU Address available right after instruction enters Memory pipe stage Y Data Memory 0 Y D WD Register File B A ALUFN Memory WB 1 ALU Y IR Register File RD2 JT B MEM Rc <25:21> RA2 C: <15:0> sign-extended ALU MEM Rb: <15:11> 5-stage pipeline RA2SEL C: <15:0> << 2 sign-extended <PCRF>+4+C*4 BSEL B Adr WB 0 Ra <20:16> Y PC 1 IF PC + Register File RD2 A PC 2 RA2 C: <15:0> sign-extended ALUFN MEM 3 Rc <25:21> JT ASEL ALU 4 RA2SEL <PCRF>+4+C*4 PC 5-stage Pipeline PCSEL 0 1 2 WDSEL Register WD File WE WERF Write Back 5/5/09 • added LE/muxes to freeze IF/RF stage so we can wait for LD to reach WB stage L23 – Pipeline Issues 12 RF-stage Bypass Details 1, 2, 3, 4… Hum, it looks like we have a few extra inputs on that mux Note: can use “distributed” mux built from tristate drivers. Register File RD1 RD2 from ALU/MEM/WB Bypass Implementation from ALU/MEM/WB A Bypass B Bypass Part 2 of the WDSEL mux Part 1 of the WDSEL mux from mem YMEM Bypass From WB to regs • from regs • • • Bypass From MEM • • • • YALU Bypass From ALU • • • • A B alu PCALU • • to alu JT PC+4+4*SXT(C) Z 1 0 SXT(C) 1 ASEL A 0 Select “0” BSEL B We’ve been a little sloppy about this detail To ALU D To ALU 5/5/09 6.004 – Spring 2009 To Mem L23 – Pipeline Issues 13 Bypass Logic To reduce the amount of bypass logic, the WDSEL mux has been split: choice between ALU and PC+4 is made in ALU stage, choice between ALU/PC and MEMDATA is made is WB stage. 5/5/09 6.004 – Spring 2009 Linkage Register Write Timing | The code: Assume Reg[LP] = 100... ADD(r31, r31, LP) BR(f, LP) | Reg[LP] = .+4 x: SUB(LP, 1, LP) ... f: XOR(LP, r31, r0) OR(r31, LP, r1) ADD(r31, LP, r2) Beta Bypass logic (need two copies for A/B data): Ra or Rb/Rc Regfile (no bypass) Rc (WB)* WB bypass Rc (MEM)* MEM bypass IF ALU bypass Rc (ALU)* RF Select “0” “31” ALU MEM 5-bit compare WB * If instruction is a ST (doesn’t write into regfile), set RC for ALU/MEM/WB to R31 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 14 L23 – Pipeline Issues 15 i i+1 ADD BR ADD i+2 i+4 SUB XOR OR BR NOP XOR ADD BR NOP ADD BR ADD BR Decision Time 6.004 – Spring 2009 i+3 ADD writes 5/5/09 i+5 ADD OR XOR NOP BR i+6 ADD OR XOR NOP Can we make XOR’s regfile access work by bypassing? BR writes L23 – Pipeline Issues 16 ILL XAdr OP PCSEL 4 3 JT 2 1 0 STALL PCSEL 00 A Instruction Memory +4 D NOP Instruction Fetch AnnulIF 0 1 STALL STALL IRRF PCRF Ra:<20:16> PCRF+4+4*SXT(C) + RA1 RD1 XOR(r28…) Rb:<15:11> Rc:<25:21> 0 1 Register File RA2 RA2SEL RD2 BYPASSES NOP BYPASSES Register File JT SXT(C) PCRF+4+4*SXT(C) For BR/JMP, Rc value is taken from PC, not ALU. Z AnnulRF 1 0 PCALU 1 IRALU 0 ASEL 1 A A,B Bypass DALU B A ALU ALUFN NOP BSEL 0 B Y ALU A, B BYPASS PCMEM Memory A,B Bypass Adr BR(…,r28) WD R/W A, B BYPASS PCWB IRWB Data Memory YWB RD 0 1 2 Register WD File WE WERF 6.004 – Spring 2009 ILL XAdr OP 4 3 2 1 PCSEL 5/5/09 L23 – Pipeline Issues 17 | Operating System: UUO Handler IllOp: ST(r0,...) | Save a reg, LD(xp,-4,r0) | Fetch bad instr ... LD(...,r0) | Restore regs, JMP(xp) | Return to pgm. 5/5/09 6.004 – Spring 2009 Illegal Opcode Traps 0 00 A +4 L23 – Pipeline Issues 18 Instruction Memory Taking Exception… D NOP 0 1 STALL Instruction Fetch AnnulIF STALL IRRF PCRF Ra:<20:16> PCRF+4+4*SXT(C) BAD BNE(…,XP) + RA1 RD1 Rb:<15:11> Rc:<25:21> 0 1 Register File RA2 RA2SEL RD2 BYPASSES NOP BYPASSES Register File Z AnnulRF 2 1 0 1 IRALU 0 ASEL A ALUFN 1 0 • PC address of IllOp handler BSEL DALU B B A A, B BYPASS Y ALU • Annul instruction in IF DMEM YMEM IRMEM In general, we’d like to annul ALL instructions following one that causes a trap or fault: FREEZE state at time of exception, for inspection by handler code. ILLEGAL INSTRUCTIONS are recognizable in RF stage of pipe; are ALL faults & traps? CONSIDER: ALU A, B BYPASS PCMEM Bad opcode decoded in RF stage: JT SXT(C) PCRF+4+4*SXT(C) BNE(R31,0,XP) PCALU ARITHMETIC EXCEPTIONS: divide by zero, etc. Memory A, B BYPASS Adr WD R/W A, B BYPASS PCWB IRWB Data Memory YWB • Force BNE(R31,0,XP) in RF stage will save PC+4 in XP when it reaches WB stage • Caught by ALU subsystem, during processing of data in ALU stage MEMORY FAULTS: Program reference to illegal memory location... • Caught by MEMORY subsystem, during processing of address input in MEM stage RD 0 1 2 Rc:<25:21> WDSEL Write Back A, B BYPASS WA 6.004 – Spring 2009 | Illegal instr. JT STALL ??? NOP IMPLEMENTATION: On Bad Opcode (discovered in RF Stage): • Select IllOp adr as next PC • Annul instruction in IF stage • Substitute BNE(r31,0,XP) for bad instruction - will (eventually) store PC+4 into XP ... need bypass paths to make XP usable immediately by code at IllOp! | User program: ... BAD(...) r: ... Write Back A, B BYPASS PCSEL PCWB is already taken care of if we bypass WB stage from output of WDSEL mux. IDEA: TRAP illegal instructions to a special routine in the Operating System, which can • Interpret them in software; or • Print humane error report. WDSEL Rc:<25:21> WA So we have to add bypass paths for PCALU and PCMEM. DMEM YMEM IRMEM Unused Opcode Traps BR/JMP PC bypass Register WD File WE WERF 5/5/09 L23 – Pipeline Issues 19 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 20 ILL XAdr OP PCSEL 4 3 JT 2 1 Annulment Logic 0 STALL PCSEL 00 A Instruction Memory +4 Asynchronous I/O Interrupts D NOP Instruction Fetch AnnulIF 0 1 STALL STALL IRRF PCRF Ra:<20:16> PCRF+4+4*SXT(C) + Rb:<15:11> RA1 RD1 Rc:<25:21> 0 1 Register File RA2 RD2 BYPASSES NOP BYPASSES Register File • PC address of fault handler ALU • Force BNE(R31,0,XP) in ??? stage will save PC+4 in XP when it reaches WB stage JT SXT(C) PCRF+4+4*SXT(C) BNE(R31,0,XP) Z AnnulRF PCALU 2 1 0 1 IRALU A, B BYPASS 0 ASEL 1 0 A B A B BSEL DALU ALU ALUFN Y NOP BNE(R31,0,XP) AnnulALU PCMEM 2 1 0 A, B BYPASS DMEM YMEM IRMEM Fault in ??? stage: Memory A, B BYPASS MEM FAULT Adr WD Take, for example, | The interrupted code: ... Interrupt ADD(...) Taken SUB(...) HERE MUL(...) XOR(...) ... R/W NOP BNE(R31,0,XP) AnnulMEM PCWB 2 1 0 • Annul all following instructions (those earlier in the pipeline): called “flushing the pipe” A, B BYPASS IRWB Data Memory YWB RD 0 1 2 Rc:<25:21> WDSEL Register WD File WE 6.004 – Spring 2009 WERF 5/5/09 L23 – Pipeline Issues 21 Interrupt Occurs PC = Xadr?? i ADD RF ALU MEM WB i+1 OR ADD ... ADD(...) SUB(...) MUL(...) XOR(...) ... OR(...) ... JMP(xp) Interrupt Taken HERE 5/5/09 6.004 – Spring 2009 Asynchronous Interrupt Timing xh: ... OR ADD i+3 i+4 Alternative: When taking interrupt, • ANNUL instruction in IF stage... BUT instead of changing it to a NOP, change it to BNE(r31,0,XP) This will cause PC+4 of annulled instruction to be written to XP! | interrupt handler i+5 • CODE HANDLER to return to Reg[XP]-4 (since the annulled instruction is never executed) PROBLEM: When does old PC+4 get written to XP??? ... ADD OR ADD IF ADD RF ALU MEM ... OR i+1 i+2 i+3 OR BNE ... ... ... i+4 i+5 i+6 i+6 ... ... OR L23 – Pipeline Issues 22 Making Interrupts Work i i+2 Can this work??? Let’s find out… | The interrupt handler: xh: OR(...) ... JMP(xp) Write Back A, B BYPASS WA IF Suppose key struck, interrupt requested (via IRQ) during the fetch of ADD. Then let’s • Select XAdr (handler) as next PC • Leave ADD in pipeline; NO annulment! • Code handler to return to SUB instruction. This should be easy. RA2SEL OR BNE OR BNE WB ... ... OR ... BNE OR Interrupt taken 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 23 6.004 – Spring 2009 5/5/09 L23 – Pipeline Issues 24 ILL XAdr OP “Smart” Interrupt Handler PCSEL 4 3 5-stage Pipeline: Final Version JT 2 1 0 STALL PCSEL 00 A +4 Instruction Memory D NOP BNE(R31,0,XP) 0 1 2 STALL IRRF PCRF | The interrupted code: ... Interrupt taken HERE, ADD(...) ADD instruction annulled SUB(...) MUL(...) XOR(...) ... | The interrupt handler: xh: OR(...) ... SUBC(xp,4,xp) | Adjust XP so we JMP(xp) | return to annulled | instruction (ADD) Instruction Fetch AnnulIF STALL Ra:<20:16> PCRF+4+4*SXT(C) + RA1 RD1 Rb:<15:11> 0 1 Register File RA2 RA2SEL RD2 BYPASSES NOP BYPASSES PCALU 2 1 0 1 IRALU A, B BYPASS 0 ASEL 1 0 A B A B ALUFN NOP BSEL DALU ALU Y ALU BNE(R31,0,XP) AnnulALU PCMEM 2 1 0 A, B BYPASS DMEM YMEM IRMEM Memory Adr AnnulMEM PCWB 2 1 0 Data Memory YWB • Can bypass results from ALU, MEM and WB back to RF RD L23 – Pipeline Issues 25 Register WD File WE 6.004 – Spring 2009 Pipeline Review WDSEL Write Back WERF 5/5/09 L23 – Pipeline Issues 26 RISC = Simplicity??? • memory data available only in WB stage Introduce NOPs at IRALU, stall IF and RF stages until LD result ready “The P.T. Barnum World’s Tallest Dwarf Competition” World’s Most Complex RISC? • handle RF/ALU/MEM stage exceptions Save PC+4 in XP (fake a BR) annul following insts. (those earlier in pipeline) 2-Stage pipeline: • increased throughput (<2x) • introduced branch delay slots Choice of executing or annulling inst. after branch 5-stage pipeline: Addressing features, eg index registers • implement interrupts Throw away IF inst., save PC+4 in XP, fix return Pipelines, Bypasses, Annulment, …, ... 5/5/09 L23 – Pipeline Issues 27 Primitive Machines: direct implementations Generalization of registers and operand coding • extra HW due to pipelining Registers to hold values between stages Data bypass muxes in RF stage Inst. muxes for “rewriting” code to annul or save PC • branch delay slots • delayed register writeback (3 cycles) Add bypass paths (10) to access correct value ? VLIWs, Super-Scalars • increased throughput (3x???) 6.004 – Spring 2009 R/W A, B BYPASS IRWB A, B BYPASS • 1 cycle/instruction • long cycle time: mem+regs+alu +mem WD NOP BNE(R31,0,XP) WA Simple unpipelined Beta: • Can stall IF and RF stages while waiting for LD result to reach WB A, B BYPASS 0 1 2 5/5/09 • Can force instruction to BNE(R31,0,XP) in each stage will save PC+4 in XP when it reaches WB stage Z AnnulRF Rc:<25:21> 6.004 – Spring 2009 Register File JT SXT(C) PCRF+4+4*SXT(C) BNE(R31,0,XP) • Can annul instruction at each stage Rc:<25:21> RISCs 6.004 – Spring 2009 Complex instructions, addressing modes 5/5/09 L23 – Pipeline Issues 28