MIT OpenCourseWare http://ocw.mit.edu 6.004 Computation Structures Spring 2009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. CPU Performance Pipelining the Beta We’ve got a working Beta… can we make it fast? betta ('be-t&) n. Any of various species of small, brightly colored, long-finned freshwater fishes of the genus Betta, found in southeast Asia. MIPS = Millions of Instructions/Second maybe they’ll give me partial credit... beta (‘bA-t&, ‘bE-) n. 1. The second letter of the Greek alphabet. 2. The exemplary computer system used in 6.004. MIPS = Freq CPI Freq = Clock Frequency, MHz CPI = Clocks per Instruction I don’t think they mean the fish... To Increase MIPS: 1. DECREASE CPI. - RISC simplicity reduces CPI to 1.0. - CPI below 1.0? Tough... you’ll see multiple instruction issue machines in 6.823. 2. INCREASE Freq. - Freq limited by delay along longest combinational path; hence - PIPELINING is the key to improved performance through fast clocks. Lab #7 due Tonight! 4/30/09 6.004 – Spring 2009 LDR(X,R3) LD(R1,10,R0) L22 – Pipelining the Beta 1 Beta Timing CLK New PC PC+4 modified 4/27/09 11:17 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 2 Why isn’t this a 20-minute lecture? We’ve learned how to pipeline combinational circuits. Fetch Inst. Control Logic +OFFSET Wanted: longest paths What’s the big deal? RA2SEL mux Complications: Read Regs =0? ASEL mux BSEL mux • operations have variable execution times (eg, ALU) ALU Fetch data PCSEL mux WDSEL mux PC setup RF setup “precedence graph” 6.004 – Spring 2009 • some apparent paths aren’t “possible” • time axis is not to scale (eg, tPD,MEM is very big!) Mem setup CLK 4/30/09 L22 – Pipelining the Beta 3 1. The Beta isn’t combinational… Explicit state in register file, memory; Hidden state in PC. 2. Consecutive operations – instruction executions – interact: • Jumps, branches dynamically change instruction sequence • Communication through registers, memory Our goals: • Move slow components into separate pipeline stages, running clock faster • Maintain instruction semantics of unpipelined Beta as far as possible 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 4 First Steps: Ultimate Goal: 5-Stage Pipeline ILL XAdr OP 4 PCSEL GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU) 3 A Simple 2-Stage Pipeline JT 2 1 PCIF RF ALU MEM WB 4/30/09 6.004 – Spring 2009 L22 – Pipelining the Beta 5 <20:16> .. ADDC(r1, 1, SUBC(r1, 1, XOR(r1, r5, MUL(r2, r6, ... i i+1 i+2 IF ADDC SUBC XOR i+3 MUL ADDC SUBC XOR 4/30/09 RD1 1 0 1 A RA2SEL RA2 WD RD2 WE WERF 0 BSEL B ALU ALUFN WD R/W Wr Data Memory Adr RD PC+4 0 1 2 WDSEL 4/30/09 6.004 – Spring 2009 L22 – Pipelining the Beta 6 Pipeline Control Hazards LOOP: BUT consider instead: EXE i+6 1 JT ASEL ... MUL WA WA 0 Z <15:0> IF ADD i+5 Register File RA1 1 <25:21> i i+4 0 r2) r3) r1) r0) TIME (cycles) <25:21> <15:11> WASEL XP Executed on our 2-stage pipeline: Pipeline IREXE 00 + 2-Stage Pipelined Beta Operation Consider a sequence of instructions: EXE D Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to Register File stage: Reads source operands from register file, passes them to ALU stage: Performs indicated operation, passes result to Memory stage: If it’s a LD, use ALU result as an address, pass mem data (or ALU result if not LD) to Write-Back stage: writes result back into register file. IF Instruction Memory +4 PCEXE 6.004 – Spring 2009 00 A APPROACH: structure processor as 5-stage pipeline: EXE IF 0 i+1 i+2 i+3 i+4 CMP BT XOR ... ADD CMP BT ? i+5 ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2) ... i+6 ... This is the cycle where the branch decision is made… but we’ve already fetched the following instruction which should be executed only if branch is not taken! ... L22 – Pipelining the Beta 7 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 8 Branch Delay Slots Branch Alternative 1 PROBLEM: One (or more) following instructions have been prefetched by the time a branch is taken. POSSIBLE SOLUTIONS: Make the hardware annul instructions in the branch delay slots of a taken branch. 1. Make hardware “annul” instructions following branches which are taken, e.g., by disabling WERF and WR. i NOP = ‘no-operation’, e.g. ADD(R31, R31, R31) 2. “Program around it”. Either IF ADD EXE a) Follow each BR with a NOP instruction; or LOOP: i+1 i+2 i+3 i+4 i+5 CMP BT XOR ADD CMP BT ADD CMP BT XOR ADD CMP Always execute instructions in delay slots ii. Conditionally execute instructions in delay slots 4/30/09 6.004 – Spring 2009 Branch taken Pros: same program runs on both unpipelined and pipelined hardware Cons: in SPEC benchmarks 14% of instructions are taken branches 12% of total cycles are annulled L22 – Pipelining the Beta 9 PCSEL 4 3 4/30/09 6.004 – Spring 2009 Branch Annulment Hardware ILL XAdr OP L22 – Pipelining the Beta 10 Branch Alternative 2a JT 2 1 PCIF 0 +4 PCEXE Fill branch delay slots with NOP instructions (i.e., the software equivalent of alternative 1) Instruction Memory 00 A D 0 1 NOP ANNULIF IREXE 00 <20:16> LOOP: ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) NOP() XOR(r3, -1, r3) MUL(r1, r2, r2) ... <25:21> <15:11> 0 + 1 RA2SEL WASEL XP <25:21> WA WA 0 ASEL Register File RA1 1 RD1 Z <15:0> RA2 WD RD2 WE i WERF i+1 i+2 i+3 i+4 i+5 CMP BT NOP ADD CMP BT ADD CMP BT NOP ADD CMP i+6 JT IF ADD 1 0 1 A 0 BSEL EXE B ALU ALUFN WD R/W Wr Branch taken Data Memory Adr RD Pros: same as alternative 1 Cons: NOPs make code longer; 12% of cycles spent executing NOPs PC+4 0 6.004 – Spring 2009 i+6 NOP b) Make compiler clever enough to move USEFUL instructions into the branch delay slots i. ADD(r1, r3, r3) CMPLEC(r3, 100, r0) BT(r0, LOOP) XOR(r3, -1, r3) MUL(r1, r2, r2) ... 4/30/09 1 2 WDSEL L22 – Pipelining the Beta 11 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 12 Branch Alternative 2b(i) Put USEFUL instructions in the branch delay slots; remember they will be executed whether the branch is taken or not i IF ADD EXE Branch Alternative 2b(ii) We need to add this silly instruction to UNDO the effects of LOOP: ADD(r1,r3,r3) LOOPx: CMPLEC(r3,100,r0) that last ADD BT(r0,LOOPx) ADD(r1,r3,r3) SUB(r3,r1,r3) XOR(r3,-1,r3) MUL(r1,r2,r2) ... i+1 i+2 i+3 i+4 CMP BT ADD CMP BT ADD CMP BT ADD CMP i+5 LOOP: ADD(r1, r3, r3) LOOPx: CMPLEC(r3, 100, r0) BT.taken(r0, LOOPx) ADD(r1, r3, r3) XOR(r3, -1, r3) MUL(r1, r2, r2) ... Put USEFUL instructions in the branch delay slots; annul them if branch doesn’t behave as predicted i+6 i+1 i+2 i+3 i+4 CMP BT ADD CMP BT ADD CMP BT ADD CMP i IF ADD ADD EXE BT Branch taken i+6 ADD BT Branch taken Pros: only two “extra” instructions are executed (on last iteration) Cons: finding “useful” instructions that are always executed is difficult; clever rewrite may be required. Program executes differently on naïve unpipelined implementation. 4/30/09 6.004 – Spring 2009 i+5 L22 – Pipelining the Beta 13 Pros: only one instruction is annulled (on last iteration); about 70% of branch delay slots can be filled with useful instructions Cons: Program executes differently on naïve unpipelined implementation; not really useful with more than one delay slot. 4/30/09 6.004 – Spring 2009 L22 – Pipelining the Beta 14 ILL XAdr OP JT Architectural Issue: Branch Decision Timing PCSEL 4 3 2 1 0 IF 00 PC A D +4 Instruction Fetch IR Ra <20:16> Wow! I guess those guys really were thinking when they made up all those instructions 6.004 – Spring 2009 C: <15:0> sign-extended RF ASEL (read) ALU PC A ALU IR 1 1 0 A Memory instruction 0 BSEL ALU D B B ALUFN CL Register File JT <PC>+C CL Rc <25:21> Register RA2 File RD2 A ALU ALTERNATIVES: • Compare-and-branch... (eg, if Reg[Ra] > Reg[Rb]) • MORE powerful, but • LATER decision (hence more delay slots) RA1 WA RD1 Z instruction Rb: <15:11> B ALU Y ALU Y MEM PC MEM MEM IR MEM Y D Adr instruction WD R/W Data Memory Y RD CL instruction Treat register file as two separate devices: combinational READ, clocked WRITE at end of pipe. ALU CL Write Back 4-Stage Beta Pipeline RA2SEL C: <15:0> << 2 sign-extended IF instruction Register File Instruction Fetch RF RF PC + BETA approach: • SIMPLE branch condition logic ... Test for Reg[Ra] = 0! • ADVANTAGE: early decision, single delay slot Instruction Memory RF (write) What other information do we have to pass down pipeline? PC (return addresses) instruction fields Suppose decision were made in the ALU stage ... then there would be 2 branch delay slots (and instructions to annul!) 4/30/09 L22 – Pipelining the Beta 15 (NB: SAME RF AS ABOVE!) 6.004 – Spring 2009 XP Rc <25:21> WASEL 0 1 WA 0 1 2 RegisterWD File WE Write Back WDSEL (decoding) What sort of improvement should expect in cycle time? WERF 4/30/09 L22 – Pipelining the Beta 16 4-Stage Beta Operation Consider a sequence of instructions: ... ADDC(r1, 1, SUBC(r1, 1, XOR(r1, r5, MUL(r2, r6, ... Pipeline “Data Hazard” CMPLEC(r3, 100, r0) r2) r3) r1) r0) MULC(r1, 100, r4) SUB(r1, r2, r5) IF TIME (cycles) Pipeline IF i+1 i+2 i+3 ADDC SUBC XOR RF MUL ADDC SUBC XOR ALU i+4 ADDC SUBC XOR WB i+5 ADD WB ... ADDC SUBC XOR ... L22 – Pipelining the Beta 17 SUB CMP MUL SUB r3 available L22 – Pipelining the Beta 18 Stall the pipeline: Freeze IF, RF stages for 2 cycles, inserting NOPs into ALU-stage instruction register IF ADD(r1, r2, r3) ADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r1, 100, r4) RF SUB(r1, r2, r5) ALU CMPLEC(r3, 100, r0) WB How often can we do this? Programmer’s fallback: Insert NOPs (sigh!) 4/30/09 r3 fetched 4/30/09 i 6.004 – Spring 2009 i+6 SUB CMP MUL ADD 6.004 – Spring 2009 EXAMPLE: Rewrite SUB(r1, r2, r5) i+5 Data Hazard Solution 2 “Program around it” ... document weirdo semantics, declare it a software problem. - Breaks sequential semantics! - Costs code efficiency. as i+4 Oops! CMP is trying to read Reg[R3] during cycle i+2 but ADD doesn’t write its result into Reg[R3] until the end of cycle i+3! MUL Data Hazard Solution 1 MULC(r1, 100, r4) i+3 SUB CMP MUL ADD ALU MUL i+2 CMP MUL i+6 4/30/09 6.004 – Spring 2009 ADD RF ... MUL i+1 i Executed on our 4-stage pipeline: i ADD(r1, r2, r3) BUT consider instead: ADD i+1 i+2 i+3 i+4 i+5 SUB i+6 CMP MUL MUL MUL ADD CMP CMP CMP MUL ADD NOP NOP CMP MUL NOP CMP ADD NOP SUB Drawback: NOPs mean “wasted” cycles L22 – Pipelining the Beta 19 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 20 Data Hazard Solution 3 Bypass Paths (I) Bypass (aka forwarding) Paths: RF IR Add extra data paths & control logic to re-route data in problem cases. i ADD IF RF i+1 i+2 i+3 CMP MUL SUB ADD CMP i+4 i+5 RA1 CMPLEC(r3,100,r0) i+6 ALU IR RD1 Register File A SUB ALU ADD CMP MUL SUB WB ADD CMP MUL B i.e., instructions which use ALU to compute result Y WB WB Y and RaRF != R31 r3 available SUB Idea: the result from the ADD which will be written into the register file at the end of cycle I+3 is actually available at output of ALU during cycle I+2 – just in time for it to be used by CMP in the RF stage! 4/30/09 Bypass muxes SELECT this BYPASS path if OpCodeRF = reads Ra and OpCodeALU = OP, OPC and RaRF = RcALU ALU IR 6.004 – Spring 2009 RD2 B A ADD(r1,r2,r3) MUL RA2 L22 – Pipelining the Beta 21 WA 6.004 – Spring 2009 Bypass Paths (II) Register File WD WE 4/30/09 L22 – Pipelining the Beta 22 Next Time RF IR RA1 ADD(r1,r2,r3) RD1 ALU IR Register File A RA2 RD2 B A Bypass muxes B ALU MULC(r4,17,r5) Y But why can’t we get It from the register file? It’s being written this cycle! WB WB Y IR XOR(r2,r6,r1) WA 6.004 – Spring 2009 Register File More Beta Bypasses Ahead SELECT this BYPASS path if OpCodeRF = reads Ra and RaRF != R31 and not using ALU bypass and WERF = 1 and RaRF = WA WD WE 4/30/09 L22 – Pipelining the Beta 23 6.004 – Spring 2009 4/30/09 L22 – Pipelining the Beta 24