Out-of-Order Execution Structures A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto MIPS R10000-Like Design • Based on: – Complexity-Effective Superscalar Processors – S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Fetch Phase • Fetch: – Read instructions from I-Cache – Predict Branches – Pass on to Decode phase A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Decode Phase • Decode: – Parse instruction – Shuffle opcode parts to appropriate ports for rename A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming Phase • Rename: – Map Architectural registers to Physical – Eliminate False Dependences – Passes renamed instructions to scheduler • Called Dispatch A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scheduling Phase • Wakeup: – Instructions check whether they become ready – From Writeback: physical register names • Select: – Amongst the ready select those to execute – Structural hazards A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register File Read Phase • Read source operands A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Bypass and Execute Phase A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Data Cache Access Phase A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Writeback Phase • Write result to register file • Broadcast tag in order to wakeup waiting instructions – Notice that the tag broadcast should happen TWO cycles in advance of the result production A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Reservation Station Model • Used by Pentium Pro, PowerPC 604 • Re-order buffer holds values • Renaming points to re-order buffer entries – Tomasulo-like A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Physical Register File vs. Reservation Station • Physical Register File – Values reside in the register file – At writeback instructions broadcast the register name • Reservation Stations: – Values reside: – In the register file upon commit • Non-speculative – In reservation stations prior to commit • Speculative A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Quantifying Complexity • Critical Path Delay as a function of architectural parameters – Instruction Window size (WinSize) – Issue Width (IW) • Full-custom Implementations – Study the critical path – Delay model – Extrapolate how it will scale with “future” technologies A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming • Inputs: – IW instructions – Up to 2 x Input register names – Up to 1 x Output register name • Outputs: – 2 x input physical registers – 1 x new output physical register – 1 x previous physical register name for checkpointing – Updated rename table • Superscalar Issue complicates things a bit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming One Instruction new reg from free list d 2 Read port s2 1 RAT 1 Read port A. Moshovos © s1 s1 1 Read port s2 p0 old d Write port p31 ECE1773 - Fall ‘07 ECE Toronto For mispeculation recovery Renaming Two Instructions Cross Bundle Dependency Check Logic ? s1 s2 d new d s1 s2 d new d ? ? RAT ps1 ps2 A. Moshovos © Old d new d ECE1773 - Fall ‘07 ECE Toronto ps1 ps2 Old d new d Renaming More Instructions • Dependency Checking logic for instruction i must match against all preceding destinations • If there are multiple matches it must enforce priority: – Pick the one closest to this instruction A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto RAT: SRAM Implementation bitlines SRAM cell decoder Arch reg #ARCH REGS lg(#PHYS REGS) Sense amp Phys reg A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto SRAM RAT cell A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto RAT: CAM Implementation Arch reg encoder CAM cell Active bit Phys reg #PHYS REGS lg(#ARCH REGS) • One CAM per physical register • Active bit indicates the current map • New version by setting active bit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wordline Bitline_B Bitline CAM Cell Match A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto SRAM vs. CAM • SRAM: – Arch reg rows – Lg(phy reg) cols – SRAM read/write • CAM: – Phy reg rows – Lg(arch reg) cols – CAM match – Update: • Reset previous valid bit • Set current valid bit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scheduler: Part #1 - Wakeup A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Scheduler: Part #2 - Select For a Single FU Tree of Arbiters Location based select policy REQ Signals GRANT Signals Anyreq raised if any req is active, Grant Issued if arbiter enabled A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Root enabled if FU available Select for more than one FUs • Handling Multiple FUs of Same Type: – Stack Select logic blocks in series - hierarchy – Mask the Request granted to previous unit • NOT Feasible for More than 2 FUs • Alternative: – statically partition issue window among FUs – MIPS R10000, HP PA 8000 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Datapath and Bypass Commonly Used Layout: Turn on TriState A to pass result of FU1 to left operand of FU0 1 Bit-Slice A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Complexity Analysis • Critical path delay as a function of: – Issue Width – Window Size • Register Renaming Table • Wakeup and Select • Bypass paths A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Methodology • A representative CMOS design is selected from published alternatives • Implemented the circuits for 3 technologies: – 0.8micron, 0.35micron and 0.18 micron • Optimize for speed • Wire parasitics in delay model – Rmetal, Cmetal A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Methodology • Feature size scaling: 1 / S • Voltage scaling: 1 / U • • • • Logic Delay = (CLx V) / I Capac. Load: CL= 1 1 / S Supply Voltage: V = 1 1 / U Average charge/discharge current: I = 1 1 / U • So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wire Delay • L: wire length • Intrinsic RC delay • Rmetal: resistance per unit length • Cmetal: capacitance per unit length • 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wire Delay Scaling • Metal Thickness doesn’t scale much – Width ~ 1/S – Rmetal ~ S • Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate • Parallel plate – scales with 1 / S – Cmetal ~ S • Length scales with 1/S • Overall Scale factor: S x S x (1/S)2 = 1 • Wire delay remains constant A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Register Renaming Table A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Dependency Checking Logic • Accessed in Parallel with Map Table • Every Logical Reg compared against logical dest regs of current rename group • For IW=2,4,8, delay less than map table r1 r4 r4 r4 r4 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming Delay • SRAM scheme • Delay Components: – Time to decode the arch reg index – Time to drive wordline – Time to pull down bit line – Time for SenseAmp to detect pull-down – MUX time ignored as control from dep. Check logic comes in advance A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Renaming Circuit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Decoder Delay A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Decoder Delay • Predecoding for speed • Length of predecode lines: – Cellheight: Height of single cell excluding wordlines – Wordline spacing • NVREG: # of virtual reg-s • x3: 3-operand instr-s A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Decoder Delay • Tnand fall delay of NAND • Tnor rise delay of NOR • Rnandpd NAND pull-down channel resistance + Predecode line metal resistance • Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Decoder Delay • Substitute • Predecode line length, Req and Ceq we get: • c2: intrinsic RC delay of predecode line • c2 very small • Decoder delay ~linearly dependent on IW A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Rename Delay • Wordline • c2: intrinsic RC delay of wordline • c2 very small • Wordline delay ~linearly dependent on IW A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Rename Delay • Bitline: • c2 very small • Bitline delay ~linearly dependent on IW • SenseAmp delay ~linearly dependent on IW A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Rename Logic Delay Scaling • • • • • • • • Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator Feature size - [increase in bitline&wordline delay with increasing IW] IW impact on delay worsens 0.8um: IW 2 8 Bitline delay + 37% with decreasing feature 0.18um: IW 28 Bitline delay + 53% size A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay • Critical Path: Mismatch Pull ready signal low • Delay Components: – Tag drivers drive tag lines - vertical – Mismatched bit: pull down stack pull matchline low – horizontal – Final OR gate or all the matchlines of an operand tag • Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C • Quadratic component significant for IW>2 & 0.18um A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay • Quadratic component Small for both cases • Both delays ~linearly dependent on IW A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: IW and Window Size • 0.18um Process • Quadratic dependence • Issue width has greater effect increase all 3 delay components • As IW & WinSize + together delay actually changes like: THIS A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: Window Size • 8 way & 0.18 Process • Tag drive delay increases rapidly with WinSize + • Match OR delay constant A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Wakeup Delay: Feature size • 8 way & 64 entry window • Tag drive and Tag match delays do not scale as well as MatchOR delay • Match OR logic delay • Others also have wire delays A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Selection Logic and Bypass Delay • Selection – Logarithmically dependent on WinSize • Bypass: Delay dependent on (IW)2 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto