CENG 450 Computer Systems and Architecture Lecture 6 Amirali Baniasadi amirali@ece.uvic.ca 1 Overview of Today’s Lecture MIPS Pipelining CPU Pipelining Example: Theoretically: Speedup should be equal to number of stages ( n tasks, k stages, p latency) Speedup = n*p =~ k (for large n) p/k*(n-1) + p Practically: Stages are imperfectly balanced Pipelining needs overhead Speedup less than number of stages If we have 3 consecutive instructions Non-pipelined needs 8 x 3 = 24 ns Pipelined needs 14 ns => Speedup = 24 / 14 = 1.7 If we have 1003 consecutive instructions Add more time for 1000 instruction (i.e. 1003 instruction)on the previous example Non-pipelined total time= 1000 x 8 + 24 = 8024 ns Pipelined total time = 1000 x 2 + 14 = 2014 ns => Speedup ~ 3.98~ (8 ns / 2 ns] ~ near perfect speedup => Performance increases for larger number of instructions (throughput) 3 MIPS: Software conventions for Registers 0 zero constant 0 16 s0 callee saves 1 at . . . (caller can clobber) 2 v0 expression evaluation & 23 s7 3 v1 function results 24 t8 4 a0 arguments 25 t9 5 a1 26 k0 reserved for OS kernel 6 a2 27 k1 7 a3 28 gp Pointer to global area 8 t0 ... reserved for assembler temporary (cont’d) temporary: caller saves 29 sp Stack pointer (callee can clobber) 30 fp frame pointer 31 ra Return Address (HW) 15 t7 Plus a 3-deep stack of mode bits. 4 Example in C: swap swap(int v[], int k) { int temp; temp = v[k]; v[k] = v[k+1]; v[k+1] = temp; } ° Assume swap is called as a procedure ° Assume temp is register $15; arguments in $a1, $a2; $16 is scratch reg: ° Write MIPS code swap: MIPS swap: addiu sw sll addu lw lw sw sw lw addiu jr $sp,$sp, –4 $16, 4($sp) $t2, $a2,2 $t2, $a1,$t2 $15, 0($t2) $16, 4($t2) $16, 0($t2) $15, 4($t2) $16, 4($sp) $sp,$sp, 4 $31 ; create space on stack ; callee saved register put onto stack ; multiply k by 4 ; address of v[k] ; load v[k] ; load v[k+1] ; store v[k+1] into v[k] ; store old value of v[k] into v[k+1] ; callee saved register restored from stack ; restore top of stack ; return to place that called swap 5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder 4 Zero? RS1 MUX MEM/WB Memory EX/MEM ALU MUX MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS2 Write Back MUX Next PC Memory Access WB Data Instruction Fetch Sign Extend RD RD RD Control Path 7 5 Steps of MIPS Datapath Execute Addr. Calc Instr. Decode Reg. Fetch Next SEQ PC Next SEQ PC Adder Zero? RS1 Inst 12 Inst 3 Inst MUX MEM/WB Memory EX/MEM ALU Sign Extend RD Inst 1 Inst 2 RD Control Path MUX MUX ID/EX Imm Reg File IF/ID Memory Address Datapath RS2 WB Data 4 Write Back MUX Next PC Memory Access RD Inst 1 Instruction Fetch 8 Review: Visualizing Pipelining Time (clock cycles) Reg DMem Ifetch Reg DMem Reg ALU DMem Reg ALU O r d e r Ifetch ALU I n s t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg Reg DMem Reg 9 Limits to pipelining Hazards: circumstances that would cause incorrect execution if next instruction were launched Structural hazards: Attempting to use the same hardware to do two different things at the same time Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). 10 Example: One Memory Port/Structural Hazard Time (clock cycles) Ifetch Reg DMem Reg DMem Reg ALU Instr 3 DMem ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Ifetch Reg Reg Reg DMem Reg Instr 4 Structural Hazard 11 Resolving structural hazards Defn: attempt to use same hardware for two different things at the same time Solution 1: Wait must detect the hazard must have mechanism to stall Solution 2: Throw more hardware at the problem 12 Detecting and Resolving Structural Hazard Time (clock cycles) Stall Instr 3 DMem Ifetch Reg DMem Reg ALU Ifetch Bubble Reg Reg DMem Bubble Bubble Ifetch Reg Reg Bubble ALU O r d e r Instr 2 Reg ALU I Load Ifetch n s Instr 1 t r. ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Bubble Reg DMem 13 Eliminating Structural Hazards at Design Time Next SEQ PC Next SEQ PC Adder Zero? RS1 MUX MEM/WB Data Cache EX/MEM ALU MUX MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 14 Role of Instruction Set Design in Structural Hazard Resolution Simple to determine the sequence of resources used by an instruction opcode tells it all Uniformity in the resource usage Compare MIPS to IA32? MIPS approach => all instructions flow through same 5-stage pipeling 15 Data Hazards Time (clock cycles) and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Ifetch DMem Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU sub r4,r1,r3 Reg ALU Ifetch ALU O r d e r add r1,r2,r3 WB ALU I n s t r. MEM ALU IF ID/RF EX Reg Reg Reg Reg DMem 16 Reg Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a “Data Dependence”. This hazard results from an actual need for communication. 17 Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 18 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes 19 Forwarding to Avoid Data Hazard or r8,r1,r9 xor r10,r1,r11 Reg DMem Ifetch Reg DMem Ifetch Reg DMem Ifetch Reg ALU and r6,r1,r7 Ifetch DMem ALU sub r4,r1,r3 Reg ALU O r d e r add r1,r2,r3 Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg Reg Reg Reg DMem 20 HW Change for Forwarding NextPC mux MEM/WR EX/MEM ALU mux ID/EX Registers mux Immediate Data Memory 21 Data Hazard Even with Forwarding and r6,r1,r7 or r8,r1,r9 DMem Ifetch Reg DMem Reg Ifetch Ifetch Reg Reg Reg DMem ALU O r d e r sub r4,r1,r6 Reg ALU lw r1, 0(r2) Ifetch ALU I n s t r. ALU Time (clock cycles) Reg Reg DMem 22 Resolving this load hazard Adding hardware? ... not Detection? Compilation techniques? What is the cost of load delays? 23 Resolving the Load Data Hazard and r6,r1,r7 or r8,r1,r9 Reg DMem Ifetch Reg Bubble Ifetch Bubble Reg Bubble Ifetch Reg DMem Reg Reg DMem ALU sub r4,r1,r6 Ifetch ALU O r d e r lw r1, 0(r2) ALU I n s t r. ALU Time (clock cycles) Reg DMem How is this different from the instruction issue stall? 24 Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd Fast code: LW LW LW ADD LW SW SUB SW Rb,b Rc,c Re,e Ra,Rb,Rc Rf,f a,Ra Rd,Re,Rf d,Rd 25 Instruction Set Connection What is exposed about this organizational hazard in the instruction set? k cycle delay? bad, CPI is not part of ISA k instruction slot delay load should not be followed by use of the value in the next k instructions Nothing, but code can reduce run-time delays MIPS did the transformation in the assembler 26 Eliminating Control Hazards at Design Time Next SEQ PC Next SEQ PC Adder Zero? RS1 MUX MEM/WB Data Cache EX/MEM ALU MUX MUX ID/EX Imm Reg File IF/ID Instr Cache Address Datapath RS2 WB Data 4 MUX Next PC Sign Extend RD RD RD Control Path 27 Example: Branch Stall Impact If 30% branch, Stall 3 cycles significant Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or 0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 28 Pipelined MIPS Datapath Instruction Fetch Memory Access Write Back Adder Adder MUX Next SEQ PC Next PC Zero? RS1 MUX MEM/WB Data Memory EX/MEM ALU MUX ID/EX Imm Reg File IF/ID Memory Address RS2 WB Data 4 Execute Addr. Calc Instr. Decode Reg. Fetch Sign Extend EXTRA HARDWARE RD RD RD • Data stationary control – local decode for each instruction phase / pipeline stage 29 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome 30 Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn ........ branch target if taken Branch delay of length n 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this 31 Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Canceling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar) 32 Recall:Speed Up Equation for Pipelining CPIpipelined Ideal CPI Average Stall cycles per Inst Cycle Timeunpipelined Ideal CPI Pipeline depth Speedup Ideal CPI Pipeline stall CPI Cycle Timepipelined For simple RISC pipeline, CPI = 1: Cycle Timeunpipelined Pipeline depth Speedup 1 Pipeline stall CPI Cycle Timepipelined 33 Example: Evaluating Branch Alternatives Pipeline speedup = Pipeline depth 1 +Branch frequency Branch penalty Assume: Conditional & Unconditional = 14%, 65% change PC Scheduling scheme Stall pipeline Predict taken Predict not taken Delayed branch Branch penalty 3 1 1 0.5 CPI 1.42 1.14 1.09 1.07 speedup v. stall 1.0 1.26 1.29 1.31 34 Summary Hazards Date Hazards & Control Hazards How to remove Hazard? Data Hazards: Forwarding Change program order Control Hazards: Speculate branch outcome Delay Slots Use extra hardware