ECE/CS 552: Introduction to Superscalar Processors Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes partially based on notes by John P. Shen Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 Inefficient unified pipeline – Long latency for each instruction Rigid pipeline stall policy – One stalled instruction stalls all newer instructions 2 © Shen, Lipasti Parallel Pipelines (a) No Parallelism (b) Temporal Parallelism (d) Parallel Pipeline (c) Spatial Parallelism 3 © Shen, Lipasti Intel Pentium Parallel Pipeline IF IF IF D1 D1 D1 D2 D2 D2 EX EX EX WB WB WB U - Pipe V - Pipe 4 © Shen, Lipasti Diversified Pipelines IF • • • ID • • • RD • • • EX ALU MEM1 FP1 MEM2 FP2 BR FP3 WB • • • 5 © Shen, Lipasti Power4 Diversified Pipelines I-Cache FP Issue Q FP1 Unit FP2 Unit Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX1 Unit PC FX/LD 2 Issue Q LD1 Unit LD2 Unit FX2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit StQ D-Cache 6 © Shen, Lipasti Rigid Pipeline Stall Policy Bypassing of Stalled Instruction Not Allowed Backward Propagation of Stalling Stalled Instruction 7 © Shen, Lipasti Dynamic Pipelines IF • • • ID • • • RD • • • ( in order ) Dispatch Buffer EX ( out of order ) ALU MEM1 FP1 MEM2 FP2 BR FP3 ( out of order ) Reorder Buffer WB ( in order ) • • • 8 © Shen, Lipasti Interstage Buffers Stage i Stage i • • 1 n ( in order ) Buffer (n) Buffer (1) • • 1 Stage i + 1 • • n ( in order ) Stage i +1 • • • (b) (a) Stage i • • • ( in order ) Buffer (>_ n) ( out of order ) Stage i + 1 © Shen, Lipasti • (c) • • 9 Superscalar Pipeline Stages Fetch Instruction Buffer In Program Order Decode Dispatch Buffer Dispatch Issuing Buffer Out of Order Execute Completion Buffer Complete In Program Order Store Buffer Retire 10 © Shen, Lipasti Limitations of Scalar Pipelines Scalar upper bound on throughput – IPC <= 1 or CPI >= 1 – Solution: wide (superscalar) pipeline Inefficient unified pipeline – Long latency for each instruction – Solution: diversified, specialized pipelines Rigid pipeline stall policy – One stalled instruction stalls all newer instructions – Solution: Out-of-order execution, distributed execution pipelines 11 © Shen, Lipasti Impediments to High IPC I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Memory Data Flow EXECUTE Register Data Flow Reorder Buffer (ROB) Store Queue COMMIT D-cache 12 © Shen, Lipasti Superscalar Pipeline Design Instruction Fetching Issues Instruction Decoding Issues Instruction Dispatching Issues Instruction Execution Issues Instruction Completion & Retiring Issues 13 © Shen, Lipasti Instruction Fetch Objective: Fetch multiple instructions per cycle Challenges: – Branches: control dependences – Branch target misalignment – Instruction cache misses PC Instruction Memory Solutions – Alignment hardware – Prediction/speculation 3 instructions fetched 14 © Shen, Lipasti Fetch Alignment 15 © Shen, Lipasti Branches – MIPS 6 Types of Branches Jump (uncond, no save PC, imm) Jump and link (uncond, save PC, imm) Jump register (uncond, no save PC, register) Jump and link register (uncond, save PC, register) Branch (conditional, no save PC, PC+imm) Branch and link (conditional, save PC, PC+imm) 16 © Shen, Lipasti Disruption of Sequential Control Flow Fetch Instruction/Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Reorder/ Completion Buffer Complete Store Buffer © Shen, Lipasti Retire 17 Branch Prediction Target address generation Target Speculation – Access register: PC, General purpose register, Link register – Perform calculation: +/- offset, autoincrement Condition resolution Condition speculation – Access register: Condition code register, General purpose register – Perform calculation: Comparison of data register(s) 18 © Shen, Lipasti Target Address Generation Fetch PCrel. Reg. ind. with offset Reg. ind. Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Completion Buffer Complete Store Buffer © Shen, Lipasti Retire 19 Condition Resolution Fetch Decode Buffer GP reg. value comp. Decode CC reg. Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish Completion Buffer Complete Store Buffer © Shen, Lipasti Retire 20 Branch Instruction Speculation to I-cache Prediction FA-mux Spec. target PC(seq.) Branch Spec. cond. Predictor (using a BTB) BTB update (target addr. and history) PC(seq.) = FA (fetch address) Fetch Decode Buffer Decode Dispatch Buffer Dispatch Reservation Stations Issue Branch Execute Finish © Shen, Lipasti Completion Buffer 21 Static Branch Prediction Single-direction – Always not-taken: Intel i486 Backwards Taken/Forward Not Taken – Loop-closing branches have negative offset – Used as backup in Pentium Pro, II, III, 4 22 © Shen, Lipasti Static Branch Prediction Profile-based 1. Instrument program binary 2. Run with representative (?) input set 3. Recompile program a. b. Annotate branches with hint bits, or Restructure code to match predict not-taken Performance: 75-80% accuracy – Much higher for “easy” cases 23 © Shen, Lipasti Dynamic Branch Prediction Main advantages: – Learn branch behavior autonomously No compiler analysis, heuristics, or profiling – Adapt to changing branch behavior Program phase changes branch behavior First proposed in 1980 – US Patent #4,370,711, Branch predictor using random access memory, James. E. Smith Continually refined since then © Shen, Lipasti 24 Smith Predictor Hardware 2m k-bit counters Branch Address Updated Counter Value m Saturating Counter Increment/Decrement most significant bit Branch Prediction Branch Outcome Jim E. Smith. A Study of Branch Prediction Strategies. International Symposium on Computer Architecture, pages 135-148, May 1981 Widely employed: Intel Pentium, PowerPC 604, PowerPC 620, etc. 25 © Shen, Lipasti Two-level Branch Prediction PHT PC = 01011010010101 010110 BHR 0110 000000 000001 000010 000011 010100 010101 010110 1 0 010111 111110 111111 1 Branch Prediction BHR adds global branch history – Provides more context – Can differentiate multiple instances of the same static branch – Can correlate behavior across multiple static branches 26 © Shen, Lipasti Combining or Hybrid Predictors Branch Address Branch Address M P0 P1 M gshare PAp Meta-Prediction Branch Prediction Branch Prediction Select “best” history Reduce interference w/partial updates Scott McFarling. Combining Branch Predictors. TN-36, Digital Equipment Corporation Western Research Laboratory, June 1993. 27 © Shen, Lipasti Branch Target Prediction Branch Address Branch Target Buffer target Branch Direction Predictor tag target tag ... target tag Size of Instruction = = = + OR not-taken target taken-target BTB Hit? 0 1 Branch Target Partial tags sufficient in BTB 28 © Shen, Lipasti Return Address Stack Bra nch Address Bra nch Address Size of Instruction BTB Return Address BTB Target Prediction is this a return? (a) Target Prediction (b) For each call/return pair: – Call: push return address onto hardware stack – Return: pop return address from hardware stack 29 © Shen, Lipasti Branch Speculation NT NT NT T T NT T (TAG 2) T NT (TAG 1) NT T T NT T (TAG 3) Leading Speculation – Typically done during the Fetch stage – Based on potential branch instruction(s) in the current fetch group Trailing Confirmation – Typically done during the Branch Execute stage – Based on the next Branch instruction to finish execution 30 © Shen, Lipasti Branch Speculation Leading Speculation 1. Tag speculative instructions 2. Advance branch and following instructions 3. Buffer addresses of speculated branch instructions Trailing Confirmation 1. When branch resolves, remove/deallocate speculation tag 2. Permit completion of branch and following instructions 31 © Shen, Lipasti Branch Speculation NT NT T NT T (TAG 2) NT T NT T NT T (TAG 3) T NT T (TAG 1) Start new correct path – Must remember the alternate (non-predicted) path Eliminate incorrect path – Must ensure that the mis-speculated instructions produce no side effects 32 © Shen, Lipasti Mis-speculation Recovery Start new correct path 1. Update PC with computed branch target (if predicted NT) 2. Update PC with sequential instruction address (if predicted T) 3. Can begin speculation again at next branch Eliminate incorrect path 1. Use tag(s) to deallocate resources occupied by speculative instructions 2. Invalidate all instructions in the decode and dispatch buffers, as well as those in reservation stations 33 © Shen, Lipasti Summary: Instruction Fetch Fetch group alignment Target address generation – Branch target buffer – Return address stack Target condition generation – Static prediction – Dynamic prediction Speculative execution – Tagging/tracking instructions – Recovering from mispredicted branches 34 © Shen, Lipasti Issues in Decoding Primary Tasks – Identify individual instructions (!) – Determine instruction types – Determine dependences between instructions Two important factors – Instruction set architecture – Pipeline width 35 © Shen, Lipasti Pentium Pro Fetch/Decode 36 © Shen, Lipasti Predecoding in the AMD K5 37 © Shen, Lipasti Dependence Checking Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1 Dest Src0 Src1 ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= ?= Trailing instructions in fetch group – Check for dependence on leading instructions 38 © Shen, Lipasti Instruction Dispatch and Issue Parallel pipeline – Centralized instruction fetch – Centralized instruction decode Diversified pipeline – Distributed instruction execution 39 © Shen, Lipasti Necessity of Instruction Dispatch 40 © Shen, Lipasti Centralized Reservation Station 41 © Shen, Lipasti Distributed Reservation Station 42 © Shen, Lipasti Issues in Instruction Execution Parallel execution units – Bypassing is a real challenge Resolving register data dependences – Want out-of-order instruction execution Resolving memory data dependences – Want loads to issue as soon as possible Maintaining precise exceptions – Required by the ISA 43 © Shen, Lipasti Bypass Networks FP Issue Q FP1 Unit FP2 Unit PC Fetch Q BR Scan Decode BR Predict FX/LD 1 Issue Q FX1 Unit I-Cache FX/LD 2 Issue Q LD1 Unit LD2 Unit FX2 Unit BR/CR Issue Q CR Unit Reorder Buffer BR Unit StQ D-Cache O(n2) interconnect from/to FU inputs and outputs Associative tag-match to find operands Solutions (hurt IPC, help cycle time) – Use RF only (IBM Power4) with no bypass network – Decompose into clusters (Alpha 21264) 44 © Shen, Lipasti The Big Picture INSTRUCTION PROCESSING CONSTRAINTS Resource Contention (Structural Dependences) Control Dependences (RAW) True Dependences (WAR) Anti-Dependences Code Dependences Data Dependences Storage Conflicts Output Dependences (WAW) 45 © Shen, Lipasti Register Data Dependences Program data dependences cause hazards – True dependences (RAW) – Antidependences (WAR) – Output dependences (WAW) When are registers read and written? – Out of program order! – Hence, any/all of these can occur Solution to all three: register renaming 46 © Shen, Lipasti Register Renaming: WAR/WAW Widely employed (Core i7, Athlon/Phenom, …) Resolving WAR/WAW: – Each register write gets unique “rename register” – Writes are committed in program order at Writeback – WAR and WAW are not an issue All updates to “architected state” delayed till writeback Writeback stage always later than read stage – Reorder Buffer (ROB) enforces in-order writeback Add R3 <= … Sub R4 <= … And R3 <= … © Shen, Lipasti P32 <= … P33 <= … P35 <= … 47 Register Renaming: RAW In order, at dispatch: – Source registers checked to see if “in flight” Register map table keeps track of this If not in flight, can be read from the register file If in flight, look up “rename register” tag (IOU) – Then, allocate new register for register write Add R3 <= R2 + R1 Sub R4 <= R3 + R1 And R3 <= R4 & R2 P32 <= P2 + P1 P33 <= P32 + P1 P35 <= P33 + P2 48 © Shen, Lipasti Register Renaming: RAW Advance instruction to reservation station – Wait for rename register tag to trigger issue Reservation station enables out-of-order issue – Newer instructions can bypass stalled instructions 49 © Shen, Lipasti “Dataflow Engine” for Dynamic Execution Reg. Write Back Dispatch Buffer Dispatch Reg. File Allocate Reorder Buffer entries Ren. Reg. Reservation Stations Branch Integer Integer Reorder Buffer Float.Point Load/ Store Forwarding results to Res. Sta. & rename registers Managed as a queue; Maintains sequential order of all Instructions in flight Complete 50 © Shen, Lipasti Instruction Processing Steps •DISPATCH: •Read operands from Register File (RF) and/or Rename Buffers (RRB) •Rename destination register and allocate RRF entry •Allocate Reorder Buffer (ROB) entry •Advance instruction to appropriate Reservation Station (RS) •EXECUTE: •RS entry monitors bus for register Tag(s) to latch in pending operand(s) •When all operands ready, issue instruction into Functional Unit (FU) and deallocate RS entry (no further stalling in execution pipe) •When execution finishes, broadcast result to waiting RS entries, RRB entry, and ROB entry •COMPLETE: •Update architected register from RRB entry, deallocate RRB entry, and if it is a store instruction, advance it to Store Buffer •Deallocate ROB entry and instruction is considered architecturally completed 51 © Shen, Lipasti Load RF Write ALU RF Write D$ Execute Agen-D$ Issue RF Read Decode Rename Fetch Physical Register File Map Table R0 => P7 R1 => P3 Physical Register File … R31 => P39 Used in MIPS R10000, Pentium 4, AMD Bulldozer All registers in one place – Always accessed right before EX stage – No copying to real register file at commit 52 © Shen, Lipasti Managing Physical Registers Map Table R0 => P7 R1 => P3 … R31 => P39 Add R3 <= R2 + R1 Sub R4 <= R3 + R1 … … And R3 <= R4 & R2 P32 <= P2 + P1 P33 <= P32 + P1 P35 <= P33 + P2 Release P32 (previous R3) when this instruction completes execution What to do when all physical registers are in use? – Must release them somehow to avoid stalling – Maintain free list of “unused” physical registers Release when no more uses are possible – Sufficient: next write commits 53 © Shen, Lipasti Memory Data Dependences WAR/WAW: stores commit in order Load/Store RS Agen Mem – Hazards not possible. Why? – Store queue keeps track of pending store addresses – Loads check against these addresses – Similar to register bypass logic – Comparators are 32 or 64 bits wide (address size) Store Queue RAW: loads must check pending stores Reorder Buffer Major source of complexity in modern designs – Store queue lookup is position-based – What if store address is not yet known? 54 © Shen, Lipasti Increasing Memory Bandwidth Reg. Write Back Dispatch Buffer Dispatch Reg. File Ren. Reg. RS’s Branch Integer Integer Float.- Load/ Load/ Point Store Store Missed loads Reorder Buff. Complete Store Buff. © Shen, Lipasti Data Cache Retire 55 Issues in Completion/Retirement Out-of-order execution – ALU instructions – Load/store instructions In-order completion/retirement – Precise exceptions Solutions – Reorder buffer retires instructions in order – Store queue retires stores in order – Exceptions can be handled at any instruction boundary by reconstructing state out of ROB/SQ 56 © Shen, Lipasti A Dynamic Superscalar Processor 57 © Shen, Lipasti Superscalar Summary I-cache Branch Predictor FETCH Instruction Buffer Instruction Flow DECODE Integer Floating-point Media Memory Memory Data Flow EXECUTE Register Data Flow Reorder Buffer (ROB) Store Queue COMMIT D-cache 58 © Shen, Lipasti [John DeVale & Bryan Black, 2005] Landscape of Microprocessor Families 1 1300 1100 1700 1900 SpecINT 2000 Intel-x86 1500 900 500 300 700 AMD-x86 Itanium Power5 Power 100 DTN SPECint2000/MHz Itanium Power 3 Power4 Opteron Extreme 0.5 PIII Athlon 800 MHz PSC NWD P4 PerformanceCPU Frequency PathLength CPI 0 0 500 1000 1500 2000 Frequency (MHz) 2500 3000 3500 ** Data source www.spec.org 59 © Shen, Lipasti Superscalar Summary Instruction flow – Branches, jumps, calls: predict target, direction – Fetch alignment – Instruction cache misses Register data flow – Register renaming: RAW/WAR/WAW Memory data flow – In-order stores: WAR/WAW – Store queue: RAW – Data cache misses: missed load buffers 60 © Shen, Lipasti