CS 152 Computer Architecture and Engineering Lecture 6 – Superpipelining + Branch Prediction 2014-2-6 John Lazzaro (not a prof - “John” is always OK) TA: Eric Love www-inst.eecs.berkeley.edu/~cs152/ Play: CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB Today: First advanced processor lecture Super-pipelining: Beyond 5 stages. Short Break. Branch prediction: Can we escape control hazards in long CPU pipelines? CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB From Appendix C: Filling the branch delay slot Superpipelining CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB 5 Stage Pipeline: A point of departure Seconds Program = Instructions Program Cycles Instruction Seconds Cycle At best, the 5-stage pipeline executes one instruction per clock, with a clock period determined by the slowest stage Processor has no “multi-cycle” instructions (ex: multiply with an accumulate register) CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Superpipelining: Add more stages Goal: Reduce critical path by adding more pipeline stages. Example: 8-stage ARM XScale: extra IF, ID, data cache stages. Difficulties: Added penalties for load delays and branch misses. Ultimate Limiter: As logic delay Also, power! goes to 0, FF clk-to-Q and setup. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB 5 Stage Note: Some stages now overlap, some instructions take extra stages. 8 Stage IF IR ID+RF IR EX IR MEM IR WB CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Superpipelining techniques ... Split ALU and decode logic over several pipeline stages. Pipeline memory: Use more banks of smaller arrays, add pipeline stages between decoders, muxes. Remove “rarely-used” forwarding networks that are on critical path. Creates stalls, affects CPI. Pipeline the wires of frequently used forwarding networks. Also: Clocking tricks (example: use positive-edge AND negative-edge CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Recall: IBM Power Timing Closure “Pipeline engineering” happens here ... ... about 1/3 of project schedule From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Pipelining a 256 byte instruction memory. Fully combinational (and slow). Only read behavior shown. Can we add two pipeline stages? A7-A0: 8-bit read address { 3 { A7 A6 A5 A4 A3 A2 3 OE --> Tri-state Q outputs! OE 1 D E M U . X . . OE Byte 0-31 256 Q 256 Byte 32-63 Q ... OE . . . Byte 224-255 Q 256 M U X Data 3 output is 32 bits D0-D31 32 i.e. 4 bytes 256 Each register holds 32 bytes (256 bits) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB On a chip: “Registers” become SRAM cells Architects specify number of rows and columns. Word and bit lines slow down as array grows larger! Write Driver Write Driver Write Driver Write Driver Parallel Data I/O Lines Add muxes here to select subset of bits How could we pipeline this memory? See last CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB RISC CPU 5.85 million devices 0.65 million devices IC processes are optimized for small SRAM cells From Marvell ARM CPU paper: 90% of the 6.5 million transistors, and 60% of the chip area, is devoted to cache memories. Implication? SRAM is 6X as dense as logic. RAM Compilers On average, 30% of a modern logic chip is SRAM, which is generated by RAM compilers. Compile-time parameters set number of bits, aspect ratio, ports, etc. CS 250 L1: Fab/Design Interface UC Regents Fall 2013 © UCB ALU: Pipelining Unsigned Multiply * 1011 Facts to remember CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Building Block: Full-Adder Variant 1-bit signals: x, y, z, s, Cin, Cout x y z Cin Cout z: one bit of multiplier s x: one bit of multiplicand If z = 1, {Cout, s} <= x + y + Cin If z = 0, {Cout, s} <= y + Cin Verilog for “2-bit entity”, CS 194-6 L9: Advanced Processors I y: one bit of the “running sum” UC Regents Fall 2008 © UCB Put it together: Array computes P = A x B y To pipeline array: z x 0 Place registers between adder stages (in green). 0 0 0 Cout Cout Cout Add registers Cout to delay selected A and B Fully combinational implementation is bits CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Adding pipeline stages is not enough ... MIPS R4000: Simple 8-stage pipeline Branch stalls are the main reason why pipeline CPI > 1. 2-cycle load delay, 3-cycle branch delay. (Appendix C, Figure C.52) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB Branch Prediction CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Add pipeline stages, reduce clock period Q. Could adding pipeline stages hurt the CPI for an application? A. Yes, due to these problems: ARM XScale 8 stages CS 194-6 L9: Advanced Processors I CPI Problem Possible Solution Taken branches cause longer stalls Branch prediction, loop unrolling Cache misses take more clock cycles Larger caches, add prefetch opcodes to ISA UC Regents Fall 2008 © UCB Recall: Control hazards ... IF (Fetch) ID (Decode) IR I-Cache EX (ALU) IR MEM IR WB IR We avoiding stalling by (1) adding a branch delay slot, and (2) adding comparator to ID stage If we add more early stages, we must stall. Sample Program Time: t1 (ISA w/o branch Inst IF I1: delay slot) I2: I1: BEQ R4,R3,25 I3: I2: AND R6,R5,R4 I4: SUB R1,R9,R8 I3: I5: I6: CS 194-6 L9: Advanced Processors I t2 t3 t4 t5 ID IF EX ID IF MEM WB t6 t7 t8 EX stage computes if branch is taken If branch is taken, these instructions MUST NOT complete! UC Regents Fall 2008 © UCB Solution: Branch prediction ... IF (Fetch) ID (Decode) IR I-Cache Branch Predictor Predictions A control instr? Taken or Not Taken? If taken, where to? What PC? EX (ALU) IR MEM IR WB IR We update the PC based on the outputs of the branch predictor. If it is perfect, pipe stays full! Dynamic Predictors: a cache of branch history Time: t1 Inst IF I1: I2: I3: I4: I5: I6: CS 194-6 L9: Advanced Processors I t2 t3 t4 t5 ID IF EX ID IF MEM WB t6 t7 t8 EX stage computes if branch is taken If we predicted incorrectly, these instructions MUST NOT complete! UC Regents Fall 2008 © UCB Branch predictors cache branch history Address of branch instruction 0b0110[...]01001000 30 bits 4096 Branch Target entries ...30-bit address tag Buffer (BTB) target address Branch instruction BNEZ R1 Loop Branch History Table (BHT) = = = 0b0110[...]0010 PC + 4 + Loop = “Hit” At EX stage, update BTB/BHT, kill instructions, if necessary, CS 152: L6: Superpipelining + Branch Prediction “Taken” Address 2 state bits Drawn as fully associativ e to focus on the essentials. In real designs, “Taken” or always direct“Not Taken” mapped. UC Regents Spring 2014 © UCB Branch predictor: direct-mapped version Address of BNEZ instruction 0b011[..]010[..]100 18 bits 12 bits Branch Target Buffer (BTB) 18-bit address tag 0b011[...]01 = Hit target address PC + 4 + Loop “Taken” Address As in real-life ... direct-mapped ... BNEZ R1 Loop Branch History Table (BHT) 4096 BTB/BHT entries Update BHT/BTB for next time, once true behavior “Taken” or known “Not Taken” Must check prediction, kill instruction if needed. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB Simple (”2-bit”) Branch History Table Entry Prediction for next branch. (1 = take, 0 = not take) Initialize to 0. D Was last prediction correct? (1 = yes, 0 = no) Initialize to 1. Q D Q After we “check” prediction ... Flip bit if prediction is not correct and “last predict correct” bit is 0. Set to 1 if prediction bit was correct. Set to 0 if prediction bit was incorrect. Set to 1 if prediction bit flips. We do not change the prediction the first time it is incorrect. Why? This branch taken 10 times, then not ADDI R4,R0,11 taken once (end of loop). The next time loop: SUBI R4,R4,-1 we enter the loop, we would like to BNE R4,R0,loop predict “take” the first time through. CS 194-6 L9: Advanced Processors I UC Regents Fall 2008 © UCB “80-90% accurate” 4096-entry 2bit predictor Figure C.19 Branch Prediction: Trust, but verify ... Instr Fetch PC D Decode & Reg Fetch Execute I-Cache Q +4 IR Branch Predictor and BTB IR IR A Y Predicted PC Predictions Branch Taken/Not Taken A branch instr? Taken or Not Taken? Logic B If taken, where to? What PC? Note instruction type and branch target. B Pass to next stage.B P Prediction info --> P CS 152: L6: Superpipelining + Branch Prediction Check all predictions. Take actions if needed (kill instructions, update predictor). Prediction info --> UC Regents Spring 2014 © UCB Flowchart control for dynamic branch prediction. Figure 3.22 Spatial Predictors C code snippet: b1 b2 b3 After compilation: We want to predict this branch. Idea: Devote hardware to four 2-bit predictors for BEQZ branch. P1: Use if b1 and b2 not taken. P2: Use if b1 taken, b2 not taken. P3: Use if b1 not taken, b2 taken. P4: Use if b1 and b2 taken. Track the current taken/not-taken status of b1 and b2, and use it to choose from P1 ... P4 for BEQZ ... How? b1 b1 b2 b2 b3 Can b1 and b2 help us predict it? Branch History Register: Tracks global history Instr Fetch PC D Decode & Reg Fetch We choose which predictor I-Cache Q +4 IR Branch Predictor and BTB IR to use (and update) based on the Branch History Register. IR A Y Predicted PC Predictions A branch instr? Taken or Not Taken? B If taken, where to? What PC? Prediction B info --> Logic P CS 152: L6: Superpipelining + Branch Prediction Branch History Register D WE Q D Q WE Shift register. Holds taken/not-taken status of last 2UCbranches. Regents Spring 2014 © UCB Spatial branch predictor (BTB, tag not shown) 0b0110[...]01001000 BEQZ R3 L3 Branch History Tables Map PC to index P1 P2 P3 P4 2 state bits 2 state bits 2 state bits 2 state bits Branch History Register D Q Mux to choose “which branch (bb==2) WE predictor” branch D Q (aa==2) “Taken” or “Not Taken” branch WE For (aa!= bb) CS 152: L6: Superpipelining + Branch Prediction UC Regents Spring 2014 © UCB Performance For more details on branch prediction: 4096 vs 1024? Fair comparison, matches total # of bits) One BHT (4096 entries) Spatial (4 BHTs, each with 1024 entries) Figure 3.3 Predict function returns by stacking call info Figure 3.24 Hardware limits to superpipelining? FO4 Delays CPU Clock Periods 1985-2005 MIPS 2000 5 stages Historical limit: about 12 FO4s Pentium Pro 10 stages * Pentium 4 20 stages Power wall: Intel Core Duo has 14 stages FO4: How many fanout-of-4 inverter delays in the clock period. Thanks to Francois Labonte, Stanford CS 250 L3: Timing UC Regents Fall 2013 © UCB On Tuesday We turn our focus to memory system design ... Have a good weekend!