ECE 4100/6100 Advanced Computer Architecture Lecture 8 Dynamic Scheduling (II) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology Modern Processors • Branch Prediction results in speculative execution • Speculative instructions (if wrongly speculated) must not alter the architecture states – Architecture Registers – Memory • Requirement of precise exception/interrupts 2 Modern Out-of-Order Core Reservation Station issues instructions to functional units Allocate instructions RS Reorder Buffer maintains state information (physical registers) for precise interrupts and speculative execution ALLOC ROB RAT ARF Architectural register file LSQ Register Alias Table renames architecture registers Load Store Queue maintains memory access ordering 3 Register Renaming Architected Registers R0 R1 R2 R3 R4 R5 R6 R7 Physical Registers T0 T2 T4 T6 T8 T10 T12 T14 T16 T18 T20 T22 Tn-2 T1 T3 T5 T7 T9 T11 T13 T15 T17 T19 T21 T23 Tn-1 Original Code R2 = R1+R3 R4 = R2 - R6 … R2 = R7 / R5 BEQ R2, #1 … R2 = R4 * R1 R6 = Load [R2] WAW WAR Renamed Code T1 = R1+R3 R4 = T1 - R6 … T20 = R7 / R5 BEQ T20, #1 … T7 = R4 * R1 R6 = Load [T7] No False Dependencies! Sandy Bridge: 160 PRs for INT 144 PRs for FP Adapted from Prof. G. Loh’s Slides 4 Register Renaming Unmapped Physical Registers Dest = Src1 op Src2 Mapping Mechanism TagD Src1 TagS1 Src2 TagS2 TagD = TagS1 op TagS2 Dest TagD Repeat for each instruction Adapted from Prof. G. Loh’s Slides 5 Register Alias Table (RAT) • Use a lookup table for renaming • One entry per architectural register • Each entry maps to the most recent version of the architectural register, could be in – Physical register file – Architectural register file RAT EAX EBX ECX EDX ESI EDI ESP EBP ROB (40 entries) Data Status RRF P6 Style Register Renaming (So does HP-PA8000, PPC604) 6 RAT Example R1 = R2 + R3 T13 = R2 + R3 R0 R1 R2 R3 R4 R5 R6 R7 - - - - - - - - T13, T14, T15, T16 R5 = R4 – R1 T14 = R4 – T13 - 13 - - - - - T14, T15, T16 R1 = R1 * R5 T15 = T13 * T14 - 13 - - - 14 - - T15, T16 R2 = R5 / R1 T16 = T14 / T15 - 15 - - - 14 - - T16 - 15 16 - - 14 - - Adapted from Prof. G. Loh’s Slides - Free PRegs 7 Superscalar Rename R1 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] From free register pool Don’t rename immediates T10 T31 T19 T6 T16 T39 T14 T5 T23 T7 T16 X RAT For N-wide superscalar: 2N RAT read-ports N RAT write-ports 8 Intra-Group Dependencies R2 = R2 + R3 R4 = R5 – R7 R3 = R0 / R2 R5 = Ld 12[R6] T16 T39 T14 T5 From free register pool RAT T10 T31 T19 T6 T23 T7 T16 X This is the wrong version of R2 Should be using this version of R2 9 Intra-Group Dependencies R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 From free register pool RAT T10 T31 T19 T6 T16 T34 T16 T16 T34 T16 T34 T34 T16 T10 T31 T31 T34 T16 T10 T19 Result of sequential renaming Correct final renamed registers 10 Resolving Intra-Group Dependencies Inst 0 Inst 1 Inst 2 Inst 3 Src L Src R Dest From free register pool Intra-Group Dependency Checker RAT T0L T0R T1L T1R T2L T2R T3L T3R Pdst0 Pdst1 Pdst2 Adapted from Prof. G. Loh’s Slides 11 Intra-Group Dependency Checking src0L srcsrc 1L 0Rsrc1R src2L src2R src3L src3R Pdst0 dst0 dst1 dst2 dst3 Pdst1 Pdst2 Pdst3 R1L = R1R = R2L = R2R R3L = = = R3R = = = = = T1L = T1R T2L 0 1 Adapted from Prof. G. Loh’s Slides T2R T3L T3R 12 Mapping Selection R1 = R2 + R1 R2 = R1 – R2 R1 = R2 / R1 R1 = R2 >> R1 Condition: use mapping if instruction is last writer to the register Adapted from Prof. G. Loh’s Slides != != use pdst0 != != use pdst1 != != use pdst2 1 use pdst3 Priority encoder Only this mapping for R1 should be written into the RAT dst0 dst1 dst2 dst3 13 Issue with Imprecise Interrupt lw r5, 8(r10) add r10, r9, r8 Instruction Page Fault add r12, r10, r7 • add instructions take one cycle • E.g., L1: add r3, r1, r2 add r4, r1, r4 add r2, r4, r4 End of Non-Resident Page X Start of Resident Page X+1 – Load (left side) induces a “data page fault”; – Add (right side) induces an “instruction page fault” • If out-of-order completion is allowed – r10, r12, (or r2, r4) … will be modified – Wrong values will be used by the re-issued load • Interrupt classes – Program interrupts (exceptions or traps) – External interrupts (asynchronous) 14 Precise Interrupts • To reflect a sequential architecture model Serially correct (think about a single issue, nonpipelined processor) • Keep “Precise State” of an execution – All instructions before the interrupted instruction must be completed – The state should appear as if no instruction issued after the interrupted instruction – The interrupted PC should be presented to the interrupt handler (restartable) • Similar to branch misprediction handling • Out-of-order execution makes the ordering hard – Undo what comes after an interrupt 15 Why Supporting Precise Interrupts • Need to maintain a precise state (for recovery) • Software debugging • I/O or timer interrupts • Virtual memory (page fault) • Instruction emulation • Virtual machines 16 Support Precise Interrupt • Buffer results • Can reconstruct the scenario (state) as sequential execution • Restart from saved PC with saved PC state 17 Reorder Buffer (ROB) [SmithPlezkun’85 ‘88] • Architecture Register File keeps “In-order state” • Reorder Buffer (ROB) – A circular buffer – Contains all in-flight instructions – buffers the “Lookahead state” – In-order allocation/deallocation with head/tail pointers • When an exception occurs – Halting instruction issues – Revert to in-order state using RF and discard ROB results • Also used for branch misprediction recovery • Pentium Pro/II/III integrates physical register file within ROB • Pentium 4 decouples ROB and physical register file 18 V Head (oldest instruction) Spec? Done? Reorder Buffer (with physical registers) PC Exp event RegDst . . . Data (physical register) . . . Tail (next inst to be allocated) Sandy Bridge : 168-entry ROB 19 V Spec? Done? Handling Precise Interrupts PC Exp event RegDst Head 01 0 1 0 1 0 0 xA000 xA004 0000 0000 R1 R2 Tail 1 0 0 xA008 0000 FR1 . . . Data (physical register) 11 R1=R1+10 R2=R2*2 FR1=FR2/0.0 . . . R1 R2 R3 R4 ARF 11 1 1 2 1 3 1 4 1 R31 20 PC Exp event 0 1 0 0 xA004 0000 R2 1 0 0 xA008 0000 FR1 FR1=FR2/0.0 1 0 0 xA00C 0000 R3 R3=R3+1 V Head Spec? Done? Handling Precise Interrupts RegDst Data (physical register) R2=R2*2 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 1 2 1 3 1 4 1 R31 21 PC Exp event 0 1 0 0 xA004 0000 R2 1 0 0 xA008 0000 FR1 1 0 1 1 0 0 xA00C xA010 0000 0000 R3 R4 V Head Spec? Done? Handling Precise Interrupts RegDst Data (physical register) R2=R2*2 FR1=FR2/0.0 4 R3=R3+1 R4=R4*2 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 1 2 1 3 1 4 1 R31 22 PC Exp event 0 1 0 0 1 1 0 0 xA004 0000 R2 xA008 0010 FR1 1 0 1 1 0 1 xA00C xA010 0000 0000 R3 R4 1 0 0 xA014 0000 FR4 V Head Spec? Done? Handling Precise Interrupts RegDst Data (physical register) 4 R2=R2*2 FR1=FR2/0.0 4 8 R3=R3+1 R4=R4*2 FR4=FR4*2.0 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 4 1 2 1 3 1 4 1 R31 23 PC Exp event 0 1 0 0 1 xA004 0000 R2 1 0 0 xA008 0010 FR1 1 0 1 1 0 1 xA00C xA010 0000 0000 R3 R4 1 0 0 xA014 0000 FR4 V Head Spec? Done? Handling Precise Interrupts RegDst Data (physical register) 4 R2=R2*2 FR1=FR2/0.0 4 8 R3=R3+1 R4=R4*2 FR4=FR4*2.0 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 1 4 1 3 1 4 1 R31 24 PC Exp event RegDst 0 0 1 0 0 xA008 0010 FR1 1 0 1 1 0 1 xA00C xA010 0000 0000 R3 R4 1 0 0 xA014 0000 FR4 V Head Spec? Done? Handling Precise Interrupts These values were not Data (physical register) committed into RF FR1=FR2/0.0 4 8 R3=R3+1 R4=R4*2 FR4=FR4*2.0 Tail Back up “PC” and current RF . . . . Exception detected. . . R1 R2 R3 R4 ARF 11 1 1 4 1 3 1 4 1 R31 Depending on the Exception, process will either abort or instruction will be resumed from this excepting instruction 25 V Head Spec? Done? Handling Speculative Execution 1 0 0 1 0 0 PC Exp event xB000 xB004 0000 0000 RegDst Data (physical register) R1=R1+10 BEQ R1, R0, L1 R1 Tail . . . . . . R1 R2 R3 R4 ARF 1 1 2 1 3 1 4 1 R31 26 PC Exp event 1 0 0 1 0 0 xB000 xB004 0000 0000 R1 1 1 1 xC100 0000 1 1 0 xC104 0000 R2 R1 1 1 0 xD2AC 0000 1 1 1 xD2B0 0000 V Head Spec? Done? Handling Speculative Execution RegDst Data (physical register) R1=R1+10 BEQ R1, R0, L1 32 R2=R3 << 2 R1=R2*R3 BEQ R3, R0, L1 R1 28 R1=R7+1 Tail . . . . . . R1 R2 R3 R4 ARF 1 1 2 1 3 1 4 1 R31 BEQ R1, R0, L1 is predicted TAKEN 27 V Head Spec? Done? Handling Speculative Execution PC Exp event 1 0 0 xB004 0000 1 1 1 xC100 0000 1 1 0 xC104 0000 1 1 0 xD2AC 0000 1 1 1 xD2B0 0000 RegDst BEQ Data (physical register) Misprediction BEQ R1, R0, L1 R2 R1 32 R2=R3 << 2 R1=R2*R3 BEQ R3, R0, L1 R1 28 R1=R7+1 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 2 1 3 1 4 1 R31 BEQ R1, R0, L1 is resolved, actually NOT TAKEN !! 28 V Head Spec? Done? Handling Speculative Execution 1 0 0 PC Exp event xB004 0000 RegDst Data (physical register) BEQ R1, R0, L1 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 2 1 3 1 4 1 R31 Retire branch, Clear all entries after the mis-speculated branch 29 V Head Spec? Done? Handling Speculative Execution 1 0 0 PC Exp event RegDst xB008 0000 R2 Data (physical register) R2=R5 << 4 Tail . . . . . . R1 R2 R3 R4 ARF 11 1 2 1 3 1 4 1 R31 Continue execution from the correct path (Fall through in this case) 30 RAT Recovery ARF br RAT ?!? ARF state corresponds to state prior to oldest non-committed instruction As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrongpath instruction state Adapted from Prof. G. Loh’s Slide 31 Solution: Stall and Drain ARF Allow all instructions to execute and commit; ARF corresponds to last committed instruction RAT ARF now corresponds to the state right before the next instruction to be renamed (foo) br X Reset RAT so that all mappings ?!? refer to the ARF Pros: Very simple to implement Resume renaming the new correctfoo Correct path instructions from Cons: Performance loss pathfetch; instructions from fetch can’t rename because RAT is wrong due to stalls 32 Another Solution: Checkpointing ARF At each branch, make a copy of the RAT (register mapping at the time of the branch) br br br br foo RAT RAT RAT RAT RAT Checkpoint Free Pool On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints 3. recover RAT from checkpoint 4. resume renaming 33 Modern Instruction Scheduler • At dispatch, instruction read all available operands from the register files and store a copy in the scheduler (Tomasulo’s algorithm) Fetch & Dispatch PRF/ROB Functional Units Bypass Instruction Scheduler Adapted from Prof. G. Loh’s Slide Physical register update ARF • Unavailable operands will be “captured” from the functional unit outputs (CDB broadcast) • When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files (Wakeup and select) 34 Instruction Scheduling: Wakeup and Select • Wakeup Logic – To notify the resolution of data dependency of input operands – Wake up instructions with zero input dependency • Select Logic – Choose and fire ready instructions – Deal with structure hazard • Wakeup-select is likely on the critical path – Associative match 35 Scalar Scheduler (Issue Width = 1) = T39 T6 = T17 T39 = T15 T39 = = T8 = T42 = To Execute Logic From Prof. G. Loh’s Slide T39 Select Logic Tag Broadcast Bus T14 T16 T17 = 36 Superscalar Scheduler (Issue Width = 4) Tag Broadcast Bus [3..0] T14 T16 T17 T39 T15 T39 = === = === T8 = === = === T42 = === = === T17 To Execute Logic T6 T39 Select Logic T39 = === = === Snapshot of RS (only 4 entries shown) Adapted from Prof. G. Loh’s Slide 37 Selection Logic • Select ready instructions to be issued • Goal: to reduce the height of DFG • Methods – Location-based (e.g., leftmost ready first) •Allow simple, faster hardware – Oldest ready first •Can use location-based (in-order issue) with “compaction” •Can be slow and complex 38 Simple Select Logic Implementation Reservation Station Req2 Grant1 Grant3 Req3 Grant02 AnyQueue Enable Req1 Grant0 Grant3 Req3 Grant02 Req0 Req2 Grant1 Enable Req1 Grant0 Req0 AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Enable Req1 Grant0 Req0 AnyQueue Tree-like Arbitrated Selection Logic Req2 Grant1 Grant3 Req3 Grant02 Enable Req1 Grant0 Req0 [Palarchala ISCA’97] AnyQueue 1 39 Simple Select Logic Implementation Reservation Station Req2 Grant1 Grant3 Req3 Grant02 Grant3 Req3 Grant02 Req0 Req1 Grant0 Req2 Grant1 Req0 Req1 Grant0 Enable Enable Req2 Grant1 Grant3 Req3 Grant02 Enable Req3 Req2 Req1 Req0 Grt2 Grt1 Grt0 Grt3 Req0 Req1 Grant0 AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Enable Req0 Req1 Grant0 AnyQueue Priority Decoder 40 1 [Palarchala ISCA’97] Enable AnyQueue AnyQueue AnyQueue Simple Select Logic Implementation Reservation Station Req2 Grant1 Grant3 Req3 Grant02 Enable Req2 Grant1 Grant3 Req3 Grant02 Req0 Req1 Grant0 Enable Req0 Req1 Grant0 AnyQueue AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Enable Req0 Req1 Grant0 AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Req0 Req1 Grant0 Enable 41 1 [Palarchala ISCA’97] AnyQueue Simple Select Logic Implementation Reservation Station Req2 Grant1 Grant3 Req3 Grant02 Enable Req2 Grant1 Grant3 Req3 Grant02 Req0 Req1 Grant0 Enable Req0 Req1 Grant0 AnyQueue AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Enable Req0 Req1 Grant0 AnyQueue Req2 Grant1 Grant3 Req3 Grant02 Req0 Req1 Grant0 Enable 42 1 [Palarchala ISCA’97] AnyQueue Issues to Distinctive Functional Units Distributed Instruction Windows (e.g., MIPS R1000 or Alpha 21264) Reservation Station Reservation Station Faster to have separate instruction schedulers for different instruction types 43 Dual Issues to Multiple Units (e.g., 2 Adders) Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 Req0 Req1 Req2 Req3 Grant0 Grant1 Grant2 Grant3 44 [Palarchala Dissertation] Memory Disambiguation • Can we “undo” stores? • Stores cannot be committed to memory until they are marked ready to retire • Completed stores are queued and waiting in a store queue or store buffer • Disambiguate (and resolve) memory dependency dynamically 45 Memory Ordering Source: Alpha 21264 HRM • Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) • Load-load order trap replays 46 47 Load Store Queue (LSQ) Age-ordered RS ALLOC ROB Store Queue Load Queue Split LSQ • • • • Memory instructions are allocated into LSQ in program order LSQ manages memory reference ordering Unified LSQ vs. Split LSQ Sandy Bridge: 64 Load buffers, 36 Store buffers 48 age Issued? Issued? Issuing a Load for Execution age address address data 1 1 A 00000001 1 1 A 1 1 B 12340000 0 2 D 0 1 C 0 2 C 0 2 ??? FFFF1111 FFFFFF00 Store Queue Issued to Memory for execution Load Queue • Each load checks against older stores – Associative search – A performance issue of scalability 49 age Issued? Issued? Issuing a Load for Execution age address address data 1 1 A 00000001 1 1 A 1 1 B 12340000 1 2 D 0 1 C 0 2 C 0 2 ??? FFFF1111 FFFFFF00 Store Queue Store-to-load forwarding Load Queue • Implementation dependent: comprehensive size matching can be prohibitively expensive • Simple method: forward when a larger store (word) precedes a smaller load (half) 50 • age age address address data 1 1 A 00000001 1 1 A 1 1 B 12340000 1 2 D 0 1 C 1 2 C 0 2 ??? FFFF1111 FFFFFF00 0 3 K Store Queue Speculativel y issue for execution Load Queue Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) – – • • Issued? Issued? Issuing a Load for Execution Naively Use Memory Dependency Predictor Store, when address ready, checks newer loads in the Load Queue “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay) 51 age Issued? Issued? Store Checks Pre-Mature Loads age address address data 1 1 A 00000001 1 1 A 1 1 B 12340000 1 2 D 1 1 C 1 2 C 0 2 K FFFF1111 FFFFFF00 1 3 K 1 3 M 1 4 P Store Queue Conflict detected! Replay the load Load Queue • Store, when address ready, checks newer loads in the Load Queue – Associative Search • “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s storeload replay) 52 age Issued to memory Issued? Issued? Issuing a Store for Execution age address address data 1 4 A 11000000 1 4 A 0 6 A 0F0F0F0F 0 5 D 0 6 C 00000002 0 5 C 0 6 K Store Queue Load Queue • Shown above the basic concept • Implementation dependent – Not allow store bypassing load, since it has little impact on performance – Perform associative search 53 age Issued? Issued? Issuing a Store for Execution age address address data 1 4 A 11000000 1 4 A 0 6 A 0F0F0F0F 0 5 D 0 6 C 00000002 0 5 C 0 6 K cannot issue for execution Store Queue Load Queue 54 • Needed for – Multiprocessor support – Maintaining memory consistency model • Load-load trap invoked – Trap on the later, conflicted instructions – Replay Load-load trap Issued? Load-Load Ordering age address 0 4 A 1 5 D 1 5 C 1 6 A 1 6 M 1 6 N 0 7 K Load Queue 55