Computer Structure The P6 Micro-Architecture An Example of an Out-Of-Order Micro-processor Lihu Rappoport and Adi Yoaz 1 Computer Structure 2014 – P6 uArch The P6 Family Features – – – – – Out Of Order execution Register renaming Speculative Execution Multiple Branch prediction Super pipeline: 12 pipe stages Processor Year Freq (MHz) Bus (MHz) L2 cache Process Pentium® Pro 1995 150~200 60/66 256/512K* 0.5μ, 0.35μ Pentium® II Pentium® III 1997 1999 233~450 450~1400 66/100 100/133 512K* 256/512K 0.35μ, 0.25μ 0.25μ, 0.18μ, 0.13μ Pentium® M 2003 900~2260 400/533 1M / 2M 0.13μ, 90nm CoreTM 2005 1660~2330 667 2M 65nm CoreTM 2 2006 1800~2930 800/1066 2/4/8M 65nm, 45nm *off die 2 Computer Structure 2014 – P6 uArch P6 Arch External Bus In-Order Front End L2 MOB BIU DCU MIU IFU AGU BPU I D MS RAT 3 R S IEU FEU ROB – – – – – – BIU: Bus Interface Unit IFU: Instruction Fetch Unit (includes IC) BPU: Branch Prediction Unit ID: Instruction Decoder MS: Micro-Instruction Sequencer RAT: Register Alias Table Out-of-order Core – – – – – – – – – – ROB: Reorder Buffer RRF: Real Register File RS: Reservation Stations IEU: Integer Execution Unit FEU: Floating-point Execution Unit AGU: Address Generation Unit MIU: Memory Interface Unit DCU: Data Cache Unit MOB: Memory Order Buffer L2: Level 2 cache In-Order Retire Computer Structure 2014 – P6 uArch P6 Pipeline Next IP I1 Icache I2 I3 Decode I4 I5 I6 Reg RS Ren Wr I7 I8 In-Order Front End + rename/alloc I1: I2: I3: I4: I5: I6: I7: I8: O1: O2: R1: R2: 4 RS disp Ex Next IP Out-of-order Core O1 O3 ICache lookup ILD (instruction length decode) Retirement Steer the instruction bytes to the decoders ID1 – decode the instructions R1 R2 ID2 – decode the instructions In-order RAT – rename sources, Retirement ALLOC-assign destinations ROB-read sources RS-schedule data-ready uops for dispatch RS-dispatch uops EX Retirement Retirement Computer Structure 2014 – P6 uArch In-Order Front End Bytes Next IP Mux Instructions BPU IFU uops MS ILD ID IQ IDQ BPU – Branch Prediction Unit – predict next fetch address IFU – Instruction Fetch Unit – iTLB translates virtual to physical address (access PMH on miss) – IC supplies 16byte/cyc (access L2 cache on miss) ILD – Induction Length Decode – split bytes to instructions IQ – Instruction Queue – buffer the instructions ID – Instruction Decode – decode instructions into uops MS – Micro-Sequencer – provides uops for complex instructions 5 Computer Structure 2014 – P6 uArch Branch Prediction Need to provide predictions for the entire fetch line each cycle – Predict the first taken branch in the line, following the fetch IP Jump into the fetch line jmp Predict taken Jump out of the line jmp jmp jmp Predict not taken Predict taken Predict taken Implemented by – Splitting IP into offset within line, set, and tag – If the tag of more than one way matches the fetch IP 6 The offsets of the matching ways are ordered Ways with offset smaller than the fetch IP offset are discarded The first branch that is predicted taken is chosen as the predicted branch Computer Structure 2014 – P6 uArch The P6 BTB 512 entries in 128 sets × 4 ways – Up to 4 branches can have a tag match Each entry holds a branch target and a 4-bit local branch history – The 4 histories in each set all point to a shared 16 entry 2-bit counter array Prediction bit 1001 0 V Tag ofst T Target Hist P IP 128 1 sets 9 4 2 4 1 2 32 15 Branch Type 00- cond 01- ret 10- call 11- uncond Way 0 7 32 LRR counters 9 Per-Set Pred= msb of counter Return Stack Buffer Computer Structure 2014 – P6 uArch In-Order Front End: Decoder 16 Instruction bytes from IFU Determine where each IA instruction starts Instruction Length Decode Buffer Instructions IQ Convert instructions Into uops D0 D1 D2 4 uops 1 uop 1 uop IDQ 8 D3 1 uop • If inst aligned with dec1/2/3 decodes into >1 uops, defer it to next cycle Buffers uops • Smooth decoder’s variable throughput Computer Structure 2014 – P6 uArch Micro Operations (Uops) Each CISC inst is broken into one or more RISC uops – Each uop is (relatively) simple – Canonical representation of src/dest (2 src, 1 dest) – Increased ILP e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0] Simple instructions translate to a few uops – Typical uop count (it is not necessarily cycle count!) Reg-Reg ALU/Mov inst: 1 uop Mem-Reg Mov (load) 1 uop Mem-Reg ALU (load + op) 2 uops Reg-Mem Mov (store) 2 uops (st addr, st data) Reg-Mem ALU (ld + op + st) 4 uops Complex instructions need ucode 9 Computer Structure 2014 – P6 uArch Out-of-order Core External Bus L2 MOB BIU DCU MIU IFU AGU BTB I D MIS RAT 10 R S IEU FEU ROB Reorder Buffer (ROB): – Holds “not yet retired” instructions – 40 entries, in-order Reservation Stations (RS): – Holds “not yet executed” instructions – 20 entries Execution Units – IEU: Integer Execution Unit – FEU: Floating-point Execution Unit Memory related units – AGU: Address Generation Unit MIU: Memory Interface Unit – DCU: Data Cache Unit – MOB: Orders Memory operations – L2: Level 2 cache Computer Structure 2014 – P6 uArch Alloc & Rat Perform register allocation and renaming for ≤4 uops/cyc The Register Alias Table (RAT) – Maps architectural registers into physical registers For each arch reg, holds the number of latest phy reg that updates it – When a new uop that writes to a arch reg R is allocated Record phy reg allocated to the uop as the latest reg that updates R Arch reg #reg Location EAX 0 RRF EBX 19 ROB ECX 23 ROB The Allocator (Alloc) – Assigns each uop an entry number in the ROB / RS – For each one of the sources (architectural registers) of the uop Lookup the RAT to find out the latest phy reg updating it Write it up in the RS entry – Allocate Load & Store buffers in the MOB 11 Computer Structure 2014 – P6 uArch Re-order Buffer (ROB) Hold 40 uops which are not yet committed – At the same order as in the program Provide a large physical register space for register renaming – One physical register per each ROB entry physical register number = entry number Each uop has only one destination Buffer the execution results until retirement – Valid data is set after uop executed and result written to physical reg #entry 12 Data Valid 1 Physical Reg Data Architectural dest. reg 0 Entry Valid 1 12H EBX 1 1 1 33H ECX 2 1 0 xxx ESI 39 0 0 xxx XXX Computer Structure 2014 – P6 uArch RRF – Real Register File Holds the Architectural Register File – Architectural Register are numbered: 0 – EAX, 1 – EBX, … The value of an architectural register – is the value written to it by the last instruction committed which writes to this register RRF: 13 #entry 0 (EAX) Arch Reg Data 9AH 1 (EBX) F34H Computer Structure 2014 – P6 uArch Uop flow through the ROB Uops are entered in order – Registers renamed by the entry number Once assigned: execution order unimportant After execution – Entries marked “executed” and wait for retirement – Executed entry can be “retired” once all prior instruction have retired – Commit architectural state only after speculation (branch, exception) has resolved Retirement – Detect exceptions and mispredictions Initiate repair to get machine back on right track – Update “real registers” with value of renamed registers – Update memory – Leave the ROB 14 Computer Structure 2014 – P6 uArch Reservation station (RS) Pool of all “not yet executed” uops – Holds the uop attributes and the uop source data until it is dispatched When a uop is allocated in RS, operand values are updated – If operand is from an architectural register, value is taken from the RRF – If operand is from a phy reg, with data valid set, value taken from ROB – If operand is from a phy reg, with data valid not set, wait for value The RS maintains operands status “ready/not-ready” – Each cycle, executed uops make more operands “ready” The RS arbitrate the WB busses between the units The RS monitors the WB bus to capture data needed by awaiting uops Data can be bypassed directly from WB bus to execution unit – Uops whose all operands are ready can be dispatched for execution 15 Dispatcher chooses which of the ready uops to execute next Dispatches chosen uops to functional units Computer Structure 2014 – P6 uArch Register Renaming example IDQ Add EAX, EBX, EAX #reg EAX 0 RRF EBX 19 ROB ECX 23 ROB RAT / Alloc #reg EAX 37 ROB EBX 19 ROB ECX 23 ROB Add ROB37, ROB19, RRF0 ROB Data Valid Data DST 19 V V 12H EBX 23 V V 33H ECX 37 I x 38 I x # RRF: 16 RS Data Valid Data DST 19 V V 12H EBX 23 V V 33H ECX xxx XXX 37 V I xxx EAX xxx XXX 38 I x xxx XXX # v src1 v src2 Pdst add 1 97H 1 12H 0 EAX 97H Computer Structure 2014 – P6 uArch 37 Register Renaming example (2) IDQ sub EAX, ECX, EAX #reg EAX 37 ROB EBX 19 ROB ECX 23 ROB RAT / Alloc #reg EAX 38 ROB EBX 19 ROB ECX 23 ROB sub ROB38, ROB23, ROB37 ROB Data Valid Data DST 19 V V 12H EBX 23 V V 33H ECX 37 I x 38 I x # RRF: 17 RS Data Valid Data DST 19 V V 12H EBX 23 V V 33H ECX xxx XXX 37 V I xxx EAX xxx XXX 38 V I xxx EAX # v src1 v src2 Pdst add 1 97H 1 12H 37 sub 0 rob37 1 33H 38 0 EAX 97H Computer Structure 2014 – P6 uArch Out-of-order Core: Execution Units 2nd bypass in RS 1st bypass in MIU MIU Port 0 RS Port 1 SHF FMU FDIV IDIV FAU IEU JEU IEU Port 2 AGU Load Address Port 3,4 AGU Store Address SDB 18 internal 0-dealy bypass within each EU DCU Computer Structure 2014 – P6 uArch In-Order Retire External Bus L2 ROB: MOB BIU DCU MIU – – – – Retires up to 4 uops per clock Copies the values to the RRF Retirement is done In-order Performs exception checking IFU AGU BTB I D MIS RAT 19 R S IEU FEU ROB Computer Structure 2014 – P6 uArch In-order Retirement The process of committing the results to the architectural state of the processor Retire up to 4 uops per clock Copy the values to the RRF Retirement is done In Order Perform exception checking An instruction is retired after the following checks – Instruction has executed – All previous instructions have retired – Instruction isn’t mis-predicted – no exceptions 20 Computer Structure 2014 – P6 uArch Pipeline: Fetch Predict/Fetch Decode IQ Alloc IDQ Schedule EX RS Retire ROB Fetch 16B from I$ Length-decode instructions within 16B Write instructions into IQ 21 Computer Structure 2014 – P6 uArch Pipeline: Decode Predict/Fetch Decode IQ Alloc IDQ Schedule EX RS Retire ROB Read 4 instructions from IQ Translate instructions into uops – Asymmetric decoders (4-1-1-1) Write resulting uops into IDQ 22 Computer Structure 2014 – P6 uArch Pipeline: Allocate Predict/Fetch Decode IQ Alloc IDQ Schedule EX RS Retire ROB Allocate, port bind and rename 4 uops Allocate ROB/RS entry per uop – If source data is available from ROB or RRF, write data to RS – Otherwise, mark data not ready in RS 23 Computer Structure 2014 – P6 uArch Pipeline: EXE Predict/Fetch Decode IQ Alloc IDQ Schedule EX RS Retire ROB Ready/Schedule – Check for data-ready uops if needed functional unit available – Select and dispatch ≤6 ready uops/clock to EXE – Reclaim RS entries Write back results into RS/ROB – Write results into result buffer – Snoop write-back ports for results that are sources to uops in RS – Update data-ready status of these uops in the RS 24 Computer Structure 2014 – P6 uArch Pipeline: Retire Predict/Fetch Decode IQ Alloc IDQ Schedule EX RS Retire ROB Retire ≤4 oldest uops in ROB – Uop may retire if its ready bit is set it does not cause an exception all preceding candidates are eligible for retirement – Commit results from result buffer to RRF – Reclaim ROB entry In case of exception – Nuke and restart 25 Computer Structure 2014 – P6 uArch Jump Misprediction – Flush at Execute When the JEU detects jump misprediction it – Flush the in-order front-end – Instructions already in the OOO part continue to execute Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed – Start fetching and decoding from the “correct” path The “correct” path still be wrong A preceding uop that hasn’t executed may cause an exception A preceding jump executed OOO can also mispredict – The “correct” instruction stream is stalled at the RAT The RAT was wrongly updated also by wrong path instruction When the mispredicted branch retires – Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.) Only instruction following the jump are left – they must all be flushed Reset the RAT to point only to architectural registers – Un-stalls the in-order machine – RS gets uops from RAT and starts scheduling and dispatching them 26 Computer Structure 2014 – P6 uArch Pipeline: Branch gets to EXE Fetch IQ 27 Alloc Decode IDQ Schedule JEU RS Retire ROB Computer Structure 2014 – P6 uArch Pipeline: Mispredicted Branch EXE Flush Fetch Alloc Decode IQ IDQ Schedule JEU RS Retire ROB Flush front-end and re-steer it to correct path RAT state already updated by wrong path – Block further allocation Update BPU OOO not flushed: Instructions already in the OOO continue to execute – Including instructions following the wrong jump, which take execution resource, and waste power, but will never be committed Block younger branches from clearing 28 Computer Structure 2014 – P6 uArch Pipeline: Mispredicted Branch Retires Clear Fetch Alloc Decode IQ IDQ Schedule JEU RS Retire ROB When mispredicted branch retires – Flush OOO Only instruction following the jump are left – they must all be flushed Resets all state in the OOO (RAT, RS, RB, MOB, etc.) Reset the RAT to point only to architectural registers – Allow allocation of uops from correct path 29 Computer Structure 2014 – P6 uArch Instant Reclamation Allow a faster recovery after jump misprediction – Allow execution/allocation of uops from correct path before mispredicted jump retires Every few cycles take a checkpoint of the RAT In case of misprediction – Flush the frontend and re-steer it to the correct path – Recover RAT to latest checkpoint taken prior to misprediction – Recover RAT to exact state at misprediction Rename 4 uops/cycle from checkpoint and until branch – Flush all uops younger than the branch in the OOO 30 Computer Structure 2014 – P6 uArch Instant Reclamation Mispredicted Branch EXE Clear Decode IQ Alloc IDQ Schedule RS JEU Predict/Fetch Retire ROB JEClear raised on mispredicted macro-branches 31 Computer Structure 2014 – P6 uArch Instant Reclamation Mispredicted Branch EXE BPU Update Clear Decode IQ Alloc IDQ Schedule RS JEU Predict/Fetch Retire ROB JEClear raised on mispredicted macro-branches – – – – 32 Flush frontend and re-steer it to the correct path Flush all younger uops in OOO Update BPU Block further allocation Computer Structure 2014 – P6 uArch Pipeline: Instant Reclamation: EXE Decode IQ Alloc IDQ Schedule RS JEU Predict/Fetch Retire ROB Restore RAT from latest check-point before branch Recover RAT to its states just after the branch – Before any instruction on the wrong path Meanwhile front-end starts fetching and decoding instructions from the correct path 33 Computer Structure 2014 – P6 uArch Pipeline: Instant Reclamation: EXE Decode IQ Alloc IDQ Schedule RS JEU Predict/Fetch Retire ROB Once done restoring the RAT – allow allocation of uops from correct path 34 Computer Structure 2014 – P6 uArch Large ROB and RS are Important Large RS – Increases the window in which looking for impendent instructions Exposes more parallelism potential Large ROB – The ROB is a superset of the RS ROB size ≥ RS size – Allows for of covering long latency operations (cache miss, divide) Example – Assume there is a Load that misses the L1 cache Data takes ~10 cycles to return ~30 new instrs get into pipeline – Instructions following the Load cannot commit Pile up in the ROB – Instructions independent of the load are executed, and leave the RS As long as the ROB is not full, we can keep executing instructions – A 40 entry ROB can cover for an L1 cache miss 35 Cannot cover for an LLC cache miss, which is hundreds of cycles Computer Structure 2014 – P6 uArch OOO Execution of Memory Operations 36 Computer Structure 2014 – P6 uArch P6 Caches Blocking caches severely hurt OOO – A cache miss prevents from other cache requests (which could possibly be hits) to be served – Hurts one of the main gains from OOO – hiding caches misses Both L1 and L2 cache in the P6 are non-blocking – Initiate the actions necessary to return data to cache miss while they respond to subsequent cached data requests – Support up to 4 outstanding misses Misses translate into outstanding requests on the P6 bus The bus can support up to 8 outstanding requests Squash subsequent requests for the same missed cache line – Squashed requests not counted in number of outstanding requests – Once the engine has executed beyond the 4 outstanding requests 37 subsequent load requests are placed in the load buffer Computer Structure 2014 – P6 uArch OOO Execution of Memory Operations The RS operates based on register dependencies – RS cannot detect memory dependencies movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx movl %eax, -4(%ebp) # eax ← MEM[ebp-4] – RS dispatches memory uops when data for address calculation is ready, and the MOB and Address Generation Unit (AGU) are free – AGU computes the linear address Segment-Base + Base-Address + (Scale*Index) + Displacement Sends linear address to MOB, to be stored in Load Buffer or Store Buffer MOB resolves memory dependencies and enforces memory ordering – Some memory dependencies can be resolved statically store r1,a load r2,b can advance load before store – Problem: some cannot store r1,[r3]; load must wait till r3 is known load r2,b 38 Computer Structure 2014 – P6 uArch Load and Store Ordering x86 has small register set uses memory often – Preventing Stores from passing Stores/Loads: 3%~5% perf. loss P6 chooses not allow Stores to pass Stores/Loads – Preventing Loads from passing Loads/Stores: big perf. loss P6 allows Loads to pass Stores, and Loads to pass Loads Stores are not executed OOO – Stores are never performed speculatively there is no transparent way to undo them – Stores are also never re-ordered among themselves The Store Buffer dispatches a store only when the store has both its address and its data, and there are no older stores awaiting dispatch – Store commits its write to memory (DCU) at retirement 39 Computer Structure 2014 – P6 uArch Store Implemented as 2 Uops Store decoded as two independent uops – STA (store-address): calculates the address of the store – STD (store-data): stores the data into the Store Data buffer The actual write to memory is done when the store retires Separating STA & STD is important for memory OOO – Allows STA to dispatch earlier, even before the data is known – Address conflicts resolved earlier opens memory pipeline for other loads STA and STD can be issued to execution units in parallel – STA dispatched to AGU when its sources (base+index) are ready – STD dispatched to SDB when its source operand is available 40 Computer Structure 2014 – P6 uArch Memory Order Buffer (MOB) Store Coloring – Each Store allocated in-order in Store Buffer, and gets a SBID – Each load allocated in-order in Load Buffer, and gets LBID + current SBID Load is checked against all previous stores – Stores with SBID ≤ load’s SBID Load blocked if – Unresolved address of a relevant STAs – STA to same address, but data not ready – Missing resources (DTLB miss, DCU miss) MOB writes blocking info into load buffer – Re-dispatches load when wake-up signal received If Load is not blocked executed (bypassed) 41 LBID SBID Store - 0 Store - 1 Load 0 1 Store - 2 Load 1 2 Load 2 2 Load 3 2 Store - 3 Load 4 3 Computer Structure 2014 – P6 uArch MOB (Cont.) If a Load misses in the DCU – The DCU marks the write-back data as invalid – Assigns a fill buffer to the load, and issues an L2 request – When critical chunk is returned, wakeup and re-dispatch the load Store → Load Forwarding – Older STA with same address as load and data ready Load gets its data directly from the SB (no DCU access) Memory Disambiguation – MOB predicts if a load can proceed despite unknown STAs Predict colliding block Load if there is unknown STA (as usual) Predict non colliding execute even if there are unknown STAs – In case of wrong prediction 42 The entire pipeline is flushed when the load retires Computer Structure 2014 – P6 uArch Pipeline: Load: Allocate Schedule Alloc IDQ AGU LB Write Retire RS ROB MOB DTLB DCU WB LB Allocate ROB/RS, MOB entries Assign Store Buffer ID (SBID) to enable ordering 43 Computer Structure 2014 – P6 uArch Pipeline: Bypassed Load: EXE Alloc IDQ Schedule AGU LB Write Retire RS ROB MOB DTLB DCU WB LB 44 RS checks when data used for address calculation is ready AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp. Write load into Load Buffer DTLB Virtual → Physical + DCU set access MOB checks blocking and forwarding DCU read / Store Data Buffer read (Store → Load forwarding) Write back data / write block code Computer Structure 2014 – P6 uArch Pipeline: Blocked Load Re-dispatch Alloc IDQ Schedule AGU LB Write Retire RS ROB MOB DTLB DCU WB LB 45 MOB determines which loads are ready, and schedules one Load arbitrates for MEU DTLB Virtual → Physical + DCU set access MOB checks blocking/forwarding DCU way select / Store Data Buffer read write back data / write block code Computer Structure 2014 – P6 uArch Pipeline: Load: Retire Alloc IDQ Schedule AGU LB Write Retire RS ROB MOB DTLB DCU WB LB Reclaim ROB, LB entries Commit results to RRF 46 Computer Structure 2014 – P6 uArch Pipeline: Store: Allocate Alloc IDQ Schedule AGU SB RS Retire ROB DTLB SB Allocate ROB/RS Allocate Store Buffer entry 47 Computer Structure 2014 – P6 uArch Pipeline: Store: STA EXE Alloc IDQ Schedule AGU SB V.A. Retire RS ROB DTLB SB P.A. SB RS checks when data used for address calculation is ready – dispatches STA to AGU AGU calculates linear address Write linear address to Store Buffer DTLB Virtual → Physical Load Buffer Memory Disambiguation verification Write physical address to Store Buffer 48 Computer Structure 2014 – P6 uArch Pipeline: Store: STD EXE Alloc IDQ Schedule SB data RS Retire ROB SB RS checks when data for STD is ready – dispatches STD Write data to Store Buffer 49 Computer Structure 2014 – P6 uArch Pipeline: Senior Store Retirement Alloc IDQ Schedule RS Retire ROB MOB SB DCU SB When STA (and thus STD) retires – Store Buffer entry marked as senior When DCU idle MOB dispatches senior store Read senior entry – Store Buffer sends data and physical address DCU writes data Reclaim SB entry 50 Computer Structure 2014 – P6 uArch