stalling

Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes called a bubble lw $2, 20($3) and $12, $2, $5 1 2 IM Reg IM Clock cycle 3 4 DM Reg 5 6 7 DM Reg Reg  Notice that we’re still using forwarding in cycle 5, to get data from the MEM/WB pipeline register to the ALU 1 Stalling and forwarding  Without forwarding, we’d have to stall for two cycles to wait for the LW instruction’s writeback stage lw $2, 20($3) and $12, $2, $5 1 2 IM Reg IM 3 Clock cycle 4 5 DM 6 7 8 DM Reg Reg Reg  In general, you can always stall to avoid hazards—but dependencies are very common in real code, and stalling often can reduce performance by a significant amount 2 Load-Use Hazard Detection • Check when using instruction is decoded in ID stage • ALU operand register numbers in ID stage are given by • IF/ID.RegisterRs, IF/ID.RegisterRt • Load-use hazard when • ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt)) • If detected, stall and insert bubble How to Stall the Pipeline • Force control values in ID/EX register to 0 • EX, MEM and WB do nop (no-operation) • Prevent update of PC and IF/ID register • Using instruction is decoded again • Following instruction is fetched again • 1-cycle stall allows MEM to read data for lw • Can subsequently forward to EX stage Stalling delays the entire pipeline  If we delay the second instruction, we’ll have to delay the third one too — This is necessary to make forwarding work between AND and OR — It also prevents problems such as two instructions trying to write to the same register in the same cycle 1 lw $2, 20($3) and $12, $2, $5 or $13, $12, $2 IM 2 3 Reg IM Clock cycle 4 5 DM 7 8 Reg Reg IM 6 DM Reg Reg DM Reg 5 What about EX, MEM, WB  But what about the ALU during cycle 4, the data memory in cycle 5, and the register file write in cycle 6? lw $2, 20($3) and $12, $2, $5 or $13, $12, $2 1 2 IM Reg IM 3 Clock cycle 4 5 DM Reg Reg IM IM 6 7 DM Reg 8 Reg Reg DM Reg  Those units aren’t used in those cycles because of the stall, so we can set the EX, MEM and WB control signals to all 0s. 6 Detecting Stalls, cont.  DM mem\wb Reg Reg ex/mem Reg mem\wb IM DM id/ex and $12, $2, $5 Reg if/id IM id/ex $2, 20($3) if/id lw ex/mem When should stalls be detected? EX stage (of the instruction causing the stall) if/id  Reg What is the stall condition? if (ID/EX.MemRead = 1 and (ID/EX.rt = IF/ID.rs or ID/EX.rt = IF/ID.rt)) then stall 7 Adding hazard detection to the CPU PC Write IF/ID Write ID/EX.MemRead Hazard Unit ID/EX Rs Rt 0 0 1 Control PC WB EX/MEM M WB MEM/WB EX M WB IF/ID Read register 1 Addr ID/EX.RegisterRt Instr Read data 1 Read register 2 Write register Instruction memory Write data Read data 2 Registers 0 1 2 ALU Zero ALUSrc 0 1 2 Result 0 Address Data memory 1 Instr [15 - 0] RegDst Extend Rt Write Read data data 1 0 0 Rd 1 Rs EX/MEM.RegisterRd Forwarding Unit MEM/WB.RegisterRd 8 Stalls and Performance  Stalls reduce performance — But are required to get correct results  Compiler can arrange code to avoid hazards and stalls — Requires knowledge of the pipeline structure Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next instruction Ex: c code for A = B + E; C = B + F; stall stall lw lw add sw lw add sw $t1, $t2, $t3, $t3, $t4, $t5, $t5, 0($t0) 4($t0) $t1, $t2 12($t0) 8($t0) $t1, $t4 16($t0) 13 cycles lw lw lw add sw add sw $t1, $t2, $t4, $t3, $t3, $t5, $t5, 0($t0) 4($t0) 8($t0) $t1, $t2 12($t0) $t1, $t4 16($t0) 11 cycles Branches in the original pipelined datapath 1 0 PCSrc Control IF/ID 4 When are they resolved? ID/EX WB EX/MEM M WB MEM/WB EX M WB Add P C Add RegWrite Read Instruction address [31-0] Instruction memory Read register 1 Read data 1 Read register 2 Read data 2 Write register Write data Instr [15 - 0] Instr [20 - 16] Shift left 2 ALU 0 MemWrite Zero Result 1 Registers ALUOp ALUSrc Sign extend Address Data memory Write data RegDst MemToReg Read data 1 MemRead 0 0 Instr [15 - 11] 1 11 Branch Hazards If branch outcome determined in MEM: Flush these instructions (Set control values to 0) PC Reducing Branch Delay Move hardware to determine outcome to ID stage — Target address adder — Register comparator Example: branch taken 36: sub $10, $4, $8 40: beq $1, $3, 7 44: and $12, $2, $5 48: or $13, $2, $6 52: add $14, $4, $2 56: slt $15, $6, $7 ... 72: lw $4, 50($7) Example: Branch Taken Example: Branch Taken Data Hazards for Branches If a comparison register is a destination of 2nd or 3rd preceding ALU instruction add $1, $2, $3 IF add $4, $5, $6 … beq $1, $4, target Can resolve using forwarding ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Data Hazards for Branches If a comparison register is a destination of preceding ALU instruction or 2nd preceding load instruction Need 1 stall cycle lw $1, addr IF add $4, $5, $6 beq stalled beq $1, $4, target ID EX MEM WB IF ID EX MEM WB IF ID ID EX MEM WB Data Hazards for Branches If a comparison register is a destination of immediately preceding load instruction — Need 2 stall cycles lw $1, addr IF beq stalled beq stalled beq $1, $0, target ID EX IF ID MEM WB ID ID EX MEM WB Branch Prediction • Longer pipelines can’t readily determine branch outcome early • Stall penalty becomes unacceptable • Predict (i.e., guess) outcome of branch • Only stall if prediction is wrong • Simplest prediction strategy • predict branches not taken • Works well for loops if the loop tests are done at the start. • Fetch instruction after branch, with no delay Dynamic Branch Prediction  In deeper and superscalar pipelines, branch penalty is more significant  Use dynamic prediction  Branch prediction buffer (aka branch history table)  Indexed by recent branch instruction addresses  Stores outcome (taken/not taken)  To execute a branch  Check table, expect the same outcome  Start fetching from fall-through or target  If wrong, flush pipeline and flip prediction 1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice! outer: … … inner: … … beq …, …, inner … beq …, …, outer  Mispredict as taken on last iteration of inner loop  Then mispredict as not taken on first iteration of inner loop next time around 2-Bit Predictor Only change prediction on two successive mispredictions Calculating the Branch Target  Even with predictor, still need to calculate the target address  1-cycle penalty for a taken branch  Branch target buffer  Cache of target addresses  Indexed by PC when instruction fetched  If hit and instruction is branch predicted taken, can fetch target immediately Concluding Remarks     ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput using parallelism  More instructions completed per second  Latency for each instruction not reduced Hazards: structural, data, control  Main additions in hardware:  forwarding unit  hazard detection and stalling  branch predictor  branch target table

stalling

Related documents

Products

Support

stalling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib