Lecture 7: Register Renaming Read-After-Write Write-After-Read Write-After-Write A: R1 = R2 + R3 B: R4 = R1 * R4 A: R1 = R3 / R4 B: R3 = R2 * R4 A: R1 = R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 7 5 A 7 -2 -2 -2 9 B 9 9 3 21 3 R1 R2 R3 R4 5 A 3 3 B -2 -2 -2 9 9 -6 3 3 3 R1 R2 R3 R4 5 A 7 B 27 -2 -2 -2 9 9 9 3 3 3 R1 R2 R3 R4 5 5 A 7 -2 -2 -2 9 B 9 9 3 15 15 R1 R2 R3 R4 5 5 A -2 B -2 -2 -2 -6 9 -6 3 3 3 R1 R2 R3 R4 5 B 27 A 7 -2 -2 -2 9 9 9 3 3 3 Lecture 7: Register Renaming 2 • Register Data Dependencies (this lecture) – – – – Output dependence (WAW), also do Anti-dependence (WAR), da True dependence (RAW), dt Why is RAR not a dependency? • Memory Data Dependencies (later lecture) • Control Dependencies (earlier lectures) • Structural Dependencies – Instruction must wait until some “structure” is available • Ex: Divider, ROB entry, Branch color/tag, etc. Lecture 7: Register Renaming 3 • WAR dependencies are from reusing registers A: R1 = R3 / R4 B: R3 = R2 * R4 R1 R2 R3 R4 5 A 3 3 B -2 -2 -2 9 9 -6 3 3 3 A: R1 X = R3 / R4 B: R5 = R2 * R4 R1 R2 R3 R4 5 5 A -2 B -2 -2 -2 9 -6 -6 3 3 3 R1 R2 R3 R4 R5 5 5 A 3 B -2 -2 -2 9 9 9 3 3 3 4 -6 -6 With no dependencies, reordering still produces the correct results Lecture 7: Register Renaming 4 • WAW dependencies are also from reusing registers A: R1 = R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 5 A 7 B 27 -2 -2 -2 9 9 9 3 3 3 A: R5X= R2 + R3 B: R1 = R3 * R4 R1 R2 R3 R4 5 B 27 A 7 -2 -2 -2 9 9 9 3 3 3 R1 R2 R3 R4 R5 5 B 27 A 27 -2 -2 -2 9 9 9 3 3 3 4 4 7 Same solution works Lecture 7: Register Renaming 5 • Finite number of registers – At some point, you’re forced to overwrite somewhere – Most RISC: 32 registers, x86: only 8, x86-64: 16 • Loops, Code Reuse – If you write a value to R1 in a loop body, then R1 will be reused every iteration induces many false dep’s – Loop unrolling can help a little • Will run out of registers at some point anyway • Trade off with code bloat – Short function calls can result in similar register reuse • Inlining can help a little Lecture 7: Register Renaming 6 • Add more registers to the ISA? BAD!!! – Changing the ISA can break binary compatibility • x86-64 mostly doesn’t break compatibility, but it’s a hack – All code must be recompiled – Does not address register overwriting due to code reuse from loops and function calls – Not a scalable solution Lecture 7: Register Renaming 7 • Processor has more registers than specified by the ISA temporarily map ISA registers (“logical” or “architected” registers) to the physical registers to avoid overwrites • Components: – mapping mechanism – physical registers • allocated vs. free registers • allocation/deallocation mechanism – state maintenance (commit, mispredictions, etc.) Lecture 7: Register Renaming 8 Architected Registers R0 R1 R2 R3 R4 R5 R6 R7 Physical Registers T0 T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 T22 Tn-2 Lecture 7: Register Renaming T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 T23 Tn-1 Original Code R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] WAW WAR Renamed Code T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] No False Dependencies! 9 Unmapped Physical Registers Dest = Src1 op Src2 Mapping Mechanism TagD Src1 TagS1 Src2 TagS2 TagD = TagS1 op TagS2 Dest TagD Repeat for each instruction Lecture 7: Register Renaming 10 • Lookup Table – One entry per architected register – Entry stores physical location of most recent version of the logical register – Most recent version may be in the physical register file or in the architected register file RAT ARF PRF Lecture 7: Register Renaming 11 R1 = R2 + R3 T13 = R2 + R3 R0 R1 R2 R3 R4 R5 R6 R7 - - - - - - - - Free PRegs T13, T14, T9, T7 R5 = R4 – R1 T14 = R4 + T13 - 13 - - - - - - T14, T9, T7 R1 = R1 * R5 T9 = T13 * T14 - 13 - - - 14 - - T9, T7 R2 = R5 / R1 T7 = T14 / T9 - 9 - - - 14 - - T7 - 9 7 - - 14 - - Lecture 7: Register Renaming 12 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] From free register pool Don’t rename immediates T10 T31 T19 T6 Lecture 7: Register Renaming T16 T39 T14 T5 T23 T7 T16 X RAT For N-wide superscalar: 2N RAT read-ports N RAT write-ports 13 R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R1 R5 = Ld 12[R6] T16 T39 T14 T5 From free register pool RAT T10 T31 T19 T6 Lecture 7: Register Renaming T23 T7 T16 X This is the wrong version of R1 Should be using this version of R1 14 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 From free register pool RAT T10 T31 T19 T6 Lecture 7: Register Renaming T16 T34 T16 T16 T34 T16 T34 T34 T16 T10 T31 T31 T34 T16 T10 T19 Result of sequential renaming 15 Inst 0 Inst 1 Inst 2 Inst 3 Intra-Group Dependency Checker Src L Src R Dest RAT From free register pool Lecture 7: Register Renaming Not needed since 1st inst in a group has no earlier insts to be dependent on T0L T0R T1L T1R T2L Similarly, src1L and src1R cannot be T3Ldependent on dst1, dst2 or dst3 T2R T3R 16 src0Lsrcsrc 1L 0R src1R src2L src2R src3L src3R dst0 dst1 dst2 dst3 = R1L R1R = R2L R2R = R3L = = = R3R = = = = = T1L T1R T2L T2R Total number of comparisons: 0 1 = 2 (n(n-1)) / 2 = = O(n2) Lecture 7: Register Renaming n2 –n T3L T3R N-wide rename has O(N) gate delay? 17 src7R dst0 dst6 dst7 R7R = = = = = = = Gate delay reduced down to O(log2N) T7R Lecture 7: Register Renaming 18 R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Only this mapping for R1 should be written into the RAT Condition: use mapping if instruction is last writer to the register Lecture 7: Register Renaming dst0 dst1 dst2 dst3 != != use dst0 != != use dst1 != != use dst2 1 use dst3 19 R3 ARF R3 When an instruction commits, it updates the ARF with the new value RAT PRF Free Pool Architected register file contains the committed/non-speculative processor state T42 The ARF now contains the correct value; update the RAT T42 is no longer needed, return to the physical register free pool Lecture 7: Register Renaming 20 R3 ARF R3 RAT T17 PRF Free Pool T42 Update ARF as usual Deallocate physical register Don’t touch the RAT! (Someone else is the most recent writer to R3) At some point in the future, the newer writer of R3 commits This instruction was the most recent writer, now update the RAT Deallocate physical register Lecture 7: Register Renaming 21 • Unified with the ROB oldest ROB_head Instructions in program order ROB_tail Lecture 7: Register Renaming ROB PRF inst inst inst inst inst inst inst inst inst data data data data data data data data data inst data 22 • Free registers = all entries from ROB_tail to ROB_head – 1 • Instructions allocated into ROB in-order, so physical registers also allocated in same order – – – – – dsti = T [ROB_head] dsti+1 = T [ (ROB_head +1) % ROB_size ] dsti+2 = T [ (ROB_head +2) % ROB_size ] … dsti+N-1 = T [ (ROB_head +N-1) % ROB_size ] Lecture 7: Register Renaming 23 • No need to explicitly manage free pool – just increment ROB_tail as physical registers are allocated, increment ROB_head as registers are deallocated • Inefficiency: allocate registers to all instructions – Branches, stores (and some other insts) don’t need physical registers • Asymmetric datapath – sometimes read values from ARF, sometimes from the PRF – requires both structures to be heavily ported Lecture 7: Register Renaming 24 • Combine both ARF and PRF into a single register file – Before, ARF and PRF could be the same hardware structure, but they have distinct name spaces • e.g., ARF (R0-R7) mapped to T0-T7 and PRF mapped to T8-T99 – For a unified RF, the committed R0 could be mapped anywhere (T0-T99) • Need some way to track the “committed” state Lecture 7: Register Renaming 25 Speculative RAT R0 R1 R2 R3 R4 R5 R6 R7 Committed RAT R0 R1 R2 R3 R4 R5 R6 R7 The committed RAT along with the pointed at registers implement the logical equivalent of the ARF The speculative RAT tracks the locations of the most recent version of each architected register Both RATs may point to the same physical location (R0, R5): the most recent writer has also committed Lecture 7: Register Renaming 26 Register File A: R1 = R2 + R4 T8 = T2 + T4 B: R4 = R2 – R7 T9 = T2 + T7 C: R2 = R1 * R4 T10 = T8 * T9 D: R1 = R1 + #1 T11 = T8 + #1 E: R7 = R4 / R1 T1 = T9 + T11 ROB A B C D E Speculative RAT R0 R1 R2 R3 R4 R5 R6 R7 Committed RAT R0 R1 R2 R3 R4 R5 R6 R7 Lecture 7: Register Renaming T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 Free Pool T1 T8 T9 T4 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T23 27 • Previous example showed a stack data structure (LIFO) TOS T8 Stack HW isT17 complex due to need To 4-wide Towrite 4-wide Renameto simultaneously T23 read and theT25 top-of-stack Rename T34 T1 3 regs T9 allocated 3 regs allocated Lecture 7: Register Renaming From commit T8 T17 T23 T25 T34 T1 T9 T13 T28 28 • A queue structure (FIFO) is easier to implement – independent reading/writing of head and tail Pool Tail T8 T17 T23 T25 T34 T1 T9 T13 T28 Pool Head 3 regs allocated 2 regs deallocated • Corner case still exists when pool is empty – Either stall rename for one cycle or need more complex HW to bypass dealloc’d registers to the renamer Lecture 7: Register Renaming 29 ARF br RAT ?!? ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrongpath instruction state Lecture 7: Register Renaming 30 ARF Allow all instructions to execute and commit; ARF corresponds to last committed instruction RAT ARF now corresponds to the state right before the next instruction to be renamed (foo) br X Reset RAT so that all mappings refer to the ARF ?!? Resume renaming the new correctfoo Very simple Pros: path instructions from fetch to implement Correct path Cons: Performance loss instructions from fetch; rename because RAT is wrong due tocan’t stalls Lecture 7: Register Renaming 31 ARF At each branch, make a copy of the RAT (register mapping at the time of the branch) br br RAT br br foo RAT RAT RAT RAT Checkpoint Free Pool On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming Lecture 7: Register Renaming 32 • No need to stall front-end (?) – need to “flash copy” RAT • both for making checkpoints and recovering – need some way to “hunt down” wrong-path checkpoints for deallocation • can “walk” the ROB, but this may take more than one cycle which may introduce stalls; still faster than stall-and-drain • More hardware – need one checkpoint per branch – what if the code has nothing but branches? • worst case needs one checkpoint per ROB entry • can assign one checkpoint per branch color – stall front-end when out of branch colors/checkpoints Lecture 7: Register Renaming 33 • • Each register-writing ROB entry tracks two physical registers 1. Its allocated destination register 2. The previous physical register mapping for it architected register Example – R1 mapped to T23 – Rename new instruction X, which overwrites R1 • • • • • R1 now mapped to T19 X also records the value of an “undo mapping” of T23 – Recovery: walk ROB backwards applying the undo mappings Lower overhead: don’t need full copies of the RAT Slower?: need to walk the ROB Flexibility: can recover to any instruction; not just branches Lecture 7: Register Renaming 34 • For ROB-based PRF, deallocation is simple: – ROB_tail reset to point right after the mispredicted branch • For unified RF, allocated registers may be anywhere in the register file PReg Free Pool br Some sortst of ROB walk still required to deallocate the wrong-pathbrPRegs; do at same time with checkpoint deallocation Lecture 7: Register Renaming Committed RAT 35 3N ports: 2N read, 1N write RAT Highly ported SRAM Typical N=3,4 |ARF| = 60-100 |PRF| = 100± Only 60-100 bytes, but 9-12 ports SRAM latency typically quadratic w.r.t. #ports Lecture 7: Register Renaming 1 entry per architected register: includes int, FP, MMX/SSE, lo/hi (MIPS), control registers, FP status, predicate registers (IA64), flags (x86), etc. Each entry is log2 |PRF| bits wide, plus 1 valid bit when RF not unified (!valid register is in the ARF) Dep Check Logic Almost full pairwise dependency checks: O(N2) comparisons 36 • SRAM lookup easily pipelined • Dependency check is just combinatorial logic; easily pipelined • What if there’s a dependency ABCD ABCD renamed between groups? ABCD ABCD ABCD REN1 REN2 Lecture 7: Register Renaming ABCD EFGH ABCD EFGH ABCD ABCD haven’t updated the RAT when EFGH reads the RAT 37 • Similar to intra-group dependency checking, now must perform inter-group dependency checking Register mappings if no dependencies Overrides if dependency exists between ABCD EFGH ABCD ABCD EFGH ABCD ABCD EFGH ABCD ABCD EFGH Overrides if dependency exists between ABCD and EFGH EFGH ABCD ABCD REN1 Lecture 7: Register Renaming REN2 38 1ns/cycle, 1GHz 0.5ns/cycle, 2GHz 0.32ns/cycle, 3.14GHz Original renaming Overhead due to pipelined rename Lecture 7: Register Renaming Original renaming 39 • More stages – higher branch mispredict penalty – a lot more implementation complexity • dep check with previous group, prev-prev group, etc. • pipeline control logic, latching overhead • more circuits ( area, power), more design effort • Higher frequency – more performance if pipeline not overly exposed • need sufficiently high branch prediction accuracy • power goes up even more (P=½CV2fa) – This is on top of the extra power for the extra circuits – Extra logic effectively increases the C term Lecture 7: Register Renaming 40 • How big should the physical register file be? – ROB-based: PRF entries == ROB entries – Unified: ??? • Should have one register per instruction – How to count instructions? – Every instruction from rename to retire • instructions in fetch/decode stages haven’t been renamed, and therefore don’t need physical registers • Not every instruction needs a register (branches, stores) • How many instructions does this add up to? – N × Stages(Rename to Dispatch) + ROB_size – Less those expected to not need destinations Lecture 7: Register Renaming 41 IF ID REN Disp 1. No register allocated 2. Register allocated, but contents are bogus PRF needs to be large enough for all instructions in Region 2, but none of the registers will contain anything useful! Lecture 7: Register Renaming ROB Commit RS 3. Register contains valid data This is the only time a physical storage location is really needed Actually, only needed until last consumer reads the value 4. Overwriter commits; register has stale value; deallocate 42