L10-11_P6_2014

advertisement
Computer Structure
The P6 Micro-Architecture
An Example of an Out-Of-Order Micro-processor
Lihu Rappoport and Adi Yoaz
1
Computer Structure 2014 – P6 uArch
The P6 Family
 Features
–
–
–
–
–
Out Of Order execution
Register renaming
Speculative Execution
Multiple Branch prediction
Super pipeline: 12 pipe stages
Processor
Year
Freq (MHz)
Bus (MHz)
L2 cache
Process
Pentium® Pro 1995
150~200
60/66
256/512K*
0.5μ, 0.35μ
Pentium® II
Pentium® III
1997
1999
233~450
450~1400
66/100
100/133
512K*
256/512K
0.35μ, 0.25μ
0.25μ, 0.18μ, 0.13μ
Pentium® M
2003
900~2260
400/533
1M / 2M
0.13μ, 90nm
CoreTM
2005
1660~2330
667
2M
65nm
CoreTM 2
2006
1800~2930
800/1066
2/4/8M
65nm, 45nm
*off die
2
Computer Structure 2014 – P6 uArch
P6 Arch
External
Bus
 In-Order Front End
L2
MOB
BIU
DCU
MIU
IFU
AGU
BPU
I
D
MS
RAT
3
R
S
IEU
FEU
ROB
–
–
–
–
–
–
BIU: Bus Interface Unit
IFU: Instruction Fetch Unit (includes IC)
BPU: Branch Prediction Unit
ID: Instruction Decoder
MS: Micro-Instruction Sequencer
RAT: Register Alias Table
 Out-of-order Core
–
–
–
–
–
–
–
–
–
–
ROB: Reorder Buffer
RRF: Real Register File
RS: Reservation Stations
IEU: Integer Execution Unit
FEU: Floating-point Execution Unit
AGU: Address Generation Unit
MIU: Memory Interface Unit
DCU: Data Cache Unit
MOB: Memory Order Buffer
L2: Level 2 cache
 In-Order Retire
Computer Structure 2014 – P6 uArch
P6 Pipeline
Next
IP
I1
Icache
I2
I3
Decode
I4
I5
I6
Reg RS
Ren Wr
I7
I8
In-Order Front End + rename/alloc
I1:
I2:
I3:
I4:
I5:
I6:
I7:
I8:
O1:
O2:
R1:
R2:
4
RS
disp Ex
Next IP
Out-of-order Core O1 O3
ICache lookup
ILD (instruction length decode)
Retirement
Steer the instruction bytes to the decoders
ID1 – decode the instructions
R1 R2
ID2 – decode the instructions
In-order
RAT – rename sources,
Retirement
ALLOC-assign destinations
ROB-read sources
RS-schedule data-ready uops for dispatch
RS-dispatch uops
EX
Retirement
Retirement
Computer Structure 2014 – P6 uArch
In-Order Front End
Bytes
Next IP
Mux
Instructions
BPU
IFU
uops
MS
ILD
ID
IQ
IDQ
 BPU – Branch Prediction Unit – predict next fetch address
 IFU – Instruction Fetch Unit
– iTLB translates virtual to physical address (access PMH on miss)
– IC supplies 16byte/cyc (access L2 cache on miss)
 ILD – Induction Length Decode – split bytes to instructions
 IQ – Instruction Queue – buffer the instructions
 ID – Instruction Decode – decode instructions into uops
 MS – Micro-Sequencer – provides uops for complex instructions
5
Computer Structure 2014 – P6 uArch
Branch Prediction
 Need to provide predictions for the entire fetch line each cycle
– Predict the first taken branch in the line, following the fetch IP
Jump into
the fetch line
jmp
Predict
taken
Jump out
of the line
jmp
jmp
jmp
Predict
not taken
Predict
taken
Predict
taken
 Implemented by
– Splitting IP into offset within line, set, and tag
– If the tag of more than one way matches the fetch IP



6
The offsets of the matching ways are ordered
Ways with offset smaller than the fetch IP offset are discarded
The first branch that is predicted taken is chosen as the predicted branch
Computer Structure 2014 – P6 uArch
The P6 BTB
 512 entries in 128 sets × 4 ways
– Up to 4 branches can have a tag match
 Each entry holds a branch target and a 4-bit local branch history
– The 4 histories in each set all point to a shared 16 entry 2-bit counter array
Prediction bit
1001
0
V Tag ofst T Target Hist P
IP
128 1
sets
9
4
2
4
1
2
32
15
Branch Type
00- cond
01- ret
10- call
11- uncond
Way 0
7
32
LRR counters 9
Per-Set
Pred=
msb of
counter
Return
Stack
Buffer
Computer Structure 2014 – P6 uArch
In-Order Front End: Decoder
16 Instruction
bytes from IFU
Determine where each
IA instruction starts
Instruction Length
Decode
Buffer Instructions
IQ
Convert instructions Into uops
D0
D1
D2
4 uops 1 uop
1 uop
IDQ
8
D3
1 uop
• If inst aligned with dec1/2/3
decodes into >1 uops, defer it to
next cycle
Buffers uops
• Smooth decoder’s variable
throughput
Computer Structure 2014 – P6 uArch
Micro Operations (Uops)
 Each CISC inst is broken into one or more RISC uops
– Each uop is (relatively) simple
– Canonical representation of src/dest (2 src, 1 dest)
– Increased ILP
 e.g., pop eax becomes esp1<-esp0+4, eax1<-[esp0]
 Simple instructions translate to a few uops
– Typical uop count (it is not necessarily cycle count!)
Reg-Reg ALU/Mov inst:
1 uop
Mem-Reg Mov (load)
1 uop
Mem-Reg ALU (load + op)
2 uops
Reg-Mem Mov (store)
2 uops (st addr, st data)
Reg-Mem ALU (ld + op + st)
4 uops
 Complex instructions need ucode
9
Computer Structure 2014 – P6 uArch
Out-of-order Core
External
Bus
L2
MOB
BIU
DCU
MIU
IFU
AGU
BTB
I
D
MIS
RAT
10
R
S
IEU
FEU
ROB
 Reorder Buffer (ROB):
– Holds “not yet retired” instructions
– 40 entries, in-order
 Reservation Stations (RS):
– Holds “not yet executed” instructions
– 20 entries
 Execution Units
– IEU: Integer Execution Unit
– FEU: Floating-point Execution Unit
 Memory related units
– AGU: Address Generation Unit MIU:
Memory Interface Unit
– DCU: Data Cache Unit
– MOB: Orders Memory operations
– L2: Level 2 cache
Computer Structure 2014 – P6 uArch
Alloc & Rat
 Perform register allocation and renaming for ≤4 uops/cyc
 The Register Alias Table (RAT)
– Maps architectural registers into physical registers

For each arch reg, holds the number of latest phy reg that updates it
– When a new uop that writes to a arch reg R is allocated

Record phy reg allocated to the uop as the latest reg that updates R
Arch reg
#reg
Location
EAX
0
RRF
EBX
19
ROB
ECX
23
ROB
 The Allocator (Alloc)
– Assigns each uop an entry number in the ROB / RS
– For each one of the sources (architectural registers) of the uop


Lookup the RAT to find out the latest phy reg updating it
Write it up in the RS entry
– Allocate Load & Store buffers in the MOB
11
Computer Structure 2014 – P6 uArch
Re-order Buffer (ROB)
 Hold 40 uops which are not yet committed
– At the same order as in the program
 Provide a large physical register space for register renaming
– One physical register per each ROB entry



physical register number = entry number
Each uop has only one destination
Buffer the execution results until retirement
– Valid data is set after uop executed and result written to physical reg
#entry
12
Data
Valid
1
Physical Reg
Data
Architectural
dest. reg
0
Entry
Valid
1
12H
EBX
1
1
1
33H
ECX
2
1
0
xxx
ESI
39
0
0
xxx
XXX
Computer Structure 2014 – P6 uArch
RRF – Real Register File
 Holds the Architectural Register File
– Architectural Register are numbered: 0 – EAX, 1 – EBX, …
 The value of an architectural register
– is the value written to it by the last instruction committed
which writes to this register
RRF:
13
#entry
0 (EAX)
Arch Reg
Data
9AH
1 (EBX)
F34H
Computer Structure 2014 – P6 uArch
Uop flow through the ROB
 Uops are entered in order
– Registers renamed by the entry number
 Once assigned: execution order unimportant
 After execution
– Entries marked “executed” and wait for retirement
– Executed entry can be “retired” once all prior instruction have retired
– Commit architectural state only after speculation (branch, exception)
has resolved
 Retirement
– Detect exceptions and mispredictions

Initiate repair to get machine back on right track
– Update “real registers” with value of renamed registers
– Update memory
– Leave the ROB
14
Computer Structure 2014 – P6 uArch
Reservation station (RS)
 Pool of all “not yet executed” uops
– Holds the uop attributes and the uop source data until it is dispatched
 When a uop is allocated in RS, operand values are updated
– If operand is from an architectural register, value is taken from the RRF
– If operand is from a phy reg, with data valid set, value taken from ROB
– If operand is from a phy reg, with data valid not set, wait for value
 The RS maintains operands status “ready/not-ready”
– Each cycle, executed uops make more operands “ready”



The RS arbitrate the WB busses between the units
The RS monitors the WB bus to capture data needed by awaiting uops
Data can be bypassed directly from WB bus to execution unit
– Uops whose all operands are ready can be dispatched for execution


15
Dispatcher chooses which of the ready uops to execute next
Dispatches chosen uops to functional units
Computer Structure 2014 – P6 uArch
Register Renaming example
IDQ
Add EAX, EBX, EAX
#reg
EAX 0
RRF
EBX 19
ROB
ECX 23
ROB
RAT / Alloc

#reg
EAX 37
ROB
EBX 19
ROB
ECX 23
ROB
Add ROB37, ROB19, RRF0
ROB
Data
Valid
Data DST
19 V
V
12H EBX
23 V
V
33H ECX
37 I
x
38 I
x
#
RRF:
16
RS
Data
Valid
Data DST
19 V
V
12H EBX
23 V
V
33H ECX
xxx XXX
37 V
I
xxx EAX
xxx XXX
38 I
x
xxx XXX
#

v src1 v src2 Pdst
add 1 97H 1 12H
0 EAX 97H
Computer Structure 2014 – P6 uArch
37
Register Renaming example (2)
IDQ
sub EAX, ECX, EAX
#reg
EAX 37
ROB
EBX 19
ROB
ECX 23
ROB
RAT / Alloc

#reg
EAX 38
ROB
EBX 19
ROB
ECX 23
ROB
sub ROB38, ROB23, ROB37
ROB
Data
Valid
Data DST
19 V
V
12H EBX
23 V
V
33H ECX
37 I
x
38 I
x
#
RRF:
17
RS
Data
Valid
Data DST
19 V
V
12H EBX
23 V
V
33H ECX
xxx XXX
37 V
I
xxx EAX
xxx XXX
38 V
I
xxx EAX
#

v src1 v src2 Pdst
add 1 97H 1 12H
37
sub 0 rob37 1 33H
38
0 EAX 97H
Computer Structure 2014 – P6 uArch
Out-of-order Core: Execution Units
2nd
bypass
in RS
1st
bypass
in MIU
MIU
Port 0
RS
Port 1
SHF
FMU
FDIV
IDIV
FAU
IEU
JEU
IEU
Port 2
AGU
Load Address
Port 3,4
AGU
Store Address
SDB
18
internal 0-dealy
bypass within
each EU
DCU
Computer Structure 2014 – P6 uArch
In-Order Retire
External
Bus
L2
 ROB:
MOB
BIU
DCU
MIU
–
–
–
–
Retires up to 4 uops per clock
Copies the values to the RRF
Retirement is done In-order
Performs exception checking
IFU
AGU
BTB
I
D
MIS
RAT
19
R
S
IEU
FEU
ROB
Computer Structure 2014 – P6 uArch
In-order Retirement
 The process of committing the results to the architectural
state of the processor
 Retire up to 4 uops per clock
 Copy the values to the RRF
 Retirement is done In Order
 Perform exception checking
 An instruction is retired after the following checks
– Instruction has executed
– All previous instructions have retired
– Instruction isn’t mis-predicted
– no exceptions
20
Computer Structure 2014 – P6 uArch
Pipeline: Fetch
Predict/Fetch
Decode
IQ
Alloc
IDQ
Schedule EX
RS
Retire
ROB
 Fetch 16B from I$
 Length-decode instructions within 16B
 Write instructions into IQ
21
Computer Structure 2014 – P6 uArch
Pipeline: Decode
Predict/Fetch
Decode
IQ
Alloc
IDQ
Schedule EX
RS
Retire
ROB
 Read 4 instructions from IQ
 Translate instructions into uops
– Asymmetric decoders (4-1-1-1)
 Write resulting uops into IDQ
22
Computer Structure 2014 – P6 uArch
Pipeline: Allocate
Predict/Fetch
Decode
IQ
Alloc
IDQ
Schedule EX
RS
Retire
ROB
 Allocate, port bind and rename 4 uops
 Allocate ROB/RS entry per uop
– If source data is available from ROB or RRF, write data to RS
– Otherwise, mark data not ready in RS
23
Computer Structure 2014 – P6 uArch
Pipeline: EXE
Predict/Fetch
Decode
IQ
Alloc
IDQ
Schedule EX
RS
Retire
ROB
 Ready/Schedule
– Check for data-ready uops if needed functional unit available
– Select and dispatch ≤6 ready uops/clock to EXE
– Reclaim RS entries
 Write back results into RS/ROB
– Write results into result buffer
– Snoop write-back ports for results that are sources to uops in RS
– Update data-ready status of these uops in the RS
24
Computer Structure 2014 – P6 uArch
Pipeline: Retire
Predict/Fetch
Decode
IQ
Alloc
IDQ
Schedule EX
RS
Retire
ROB
 Retire ≤4 oldest uops in ROB
– Uop may retire if



its ready bit is set
it does not cause an exception
all preceding candidates are eligible for retirement
– Commit results from result buffer to RRF
– Reclaim ROB entry
 In case of exception
– Nuke and restart
25
Computer Structure 2014 – P6 uArch
Jump Misprediction – Flush at Execute
 When the JEU detects jump misprediction it
– Flush the in-order front-end
– Instructions already in the OOO part continue to execute

Including instructions following the wrong jump, which take execution
resource, and waste power, but will never be committed
– Start fetching and decoding from the “correct” path

The “correct” path still be wrong
 A preceding uop that hasn’t executed may cause an exception
 A preceding jump executed OOO can also mispredict
– The “correct” instruction stream is stalled at the RAT

The RAT was wrongly updated also by wrong path instruction
 When the mispredicted branch retires
– Resets all state in the Out-of-Order Engine (RAT, RS, RB, MOB, etc.)


Only instruction following the jump are left – they must all be flushed
Reset the RAT to point only to architectural registers
– Un-stalls the in-order machine
– RS gets uops from RAT and starts scheduling and dispatching them
26
Computer Structure 2014 – P6 uArch
Pipeline: Branch gets to EXE
Fetch
IQ
27
Alloc
Decode
IDQ
Schedule JEU
RS
Retire
ROB
Computer Structure 2014 – P6 uArch
Pipeline: Mispredicted Branch EXE
Flush
Fetch
Alloc
Decode
IQ
IDQ
Schedule JEU
RS
Retire
ROB
 Flush front-end and re-steer it to correct path
 RAT state already updated by wrong path
– Block further allocation
 Update BPU
 OOO not flushed: Instructions already in the OOO continue to execute
– Including instructions following the wrong jump, which take execution
resource, and waste power, but will never be committed
 Block younger branches from clearing
28
Computer Structure 2014 – P6 uArch
Pipeline: Mispredicted Branch Retires
Clear
Fetch
Alloc
Decode
IQ
IDQ
Schedule JEU
RS
Retire
ROB
 When mispredicted branch retires
– Flush OOO
 Only instruction following the jump are left – they must all be flushed
 Resets all state in the OOO (RAT, RS, RB, MOB, etc.)
 Reset the RAT to point only to architectural registers
– Allow allocation of uops from correct path
29
Computer Structure 2014 – P6 uArch
Instant Reclamation
 Allow a faster recovery after jump misprediction
– Allow execution/allocation of uops from correct path before
mispredicted jump retires
 Every few cycles take a checkpoint of the RAT
 In case of misprediction
– Flush the frontend and re-steer it to the correct path
– Recover RAT to latest checkpoint taken prior to misprediction
– Recover RAT to exact state at misprediction

Rename 4 uops/cycle from checkpoint and until branch
– Flush all uops younger than the branch in the OOO
30
Computer Structure 2014 – P6 uArch
Instant Reclamation
Mispredicted Branch EXE
Clear
Decode
IQ
Alloc
IDQ
Schedule
RS
JEU
Predict/Fetch
Retire
ROB
 JEClear raised on mispredicted macro-branches
31
Computer Structure 2014 – P6 uArch
Instant Reclamation
Mispredicted Branch EXE
BPU Update
Clear
Decode
IQ
Alloc
IDQ
Schedule
RS
JEU
Predict/Fetch
Retire
ROB
 JEClear raised on mispredicted macro-branches
–
–
–
–
32
Flush frontend and re-steer it to the correct path
Flush all younger uops in OOO
Update BPU
Block further allocation
Computer Structure 2014 – P6 uArch
Pipeline: Instant Reclamation: EXE
Decode
IQ
Alloc
IDQ
Schedule
RS
JEU
Predict/Fetch
Retire
ROB
 Restore RAT from latest check-point before branch
 Recover RAT to its states just after the branch
– Before any instruction on the wrong path
 Meanwhile front-end starts fetching and decoding instructions
from the correct path
33
Computer Structure 2014 – P6 uArch
Pipeline: Instant Reclamation: EXE
Decode
IQ
Alloc
IDQ
Schedule
RS
JEU
Predict/Fetch
Retire
ROB
 Once done restoring the RAT
– allow allocation of uops from correct path
34
Computer Structure 2014 – P6 uArch
Large ROB and RS are Important
 Large RS
– Increases the window in which looking for impendent instructions

Exposes more parallelism potential
 Large ROB
– The ROB is a superset of the RS  ROB size ≥ RS size
– Allows for of covering long latency operations (cache miss, divide)
 Example
– Assume there is a Load that misses the L1 cache

Data takes ~10 cycles to return  ~30 new instrs get into pipeline
– Instructions following the Load cannot commit  Pile up in the ROB
– Instructions independent of the load are executed, and leave the RS

As long as the ROB is not full, we can keep executing instructions
– A 40 entry ROB can cover for an L1 cache miss

35
Cannot cover for an LLC cache miss, which is hundreds of cycles
Computer Structure 2014 – P6 uArch
OOO Execution of Memory
Operations
36
Computer Structure 2014 – P6 uArch
P6 Caches
 Blocking caches severely hurt OOO
– A cache miss prevents from other cache requests (which could
possibly be hits) to be served
– Hurts one of the main gains from OOO – hiding caches misses
 Both L1 and L2 cache in the P6 are non-blocking
– Initiate the actions necessary to return data to cache miss while
they respond to subsequent cached data requests
– Support up to 4 outstanding misses
 Misses translate into outstanding requests on the P6 bus
 The bus can support up to 8 outstanding requests
 Squash subsequent requests for the same missed cache line
– Squashed requests not counted in number of outstanding requests
– Once the engine has executed beyond the 4 outstanding requests

37
subsequent load requests are placed in the load buffer
Computer Structure 2014 – P6 uArch
OOO Execution of Memory Operations
 The RS operates based on register dependencies
– RS cannot detect memory dependencies
movl -4(%ebp), %ebx # MEM[ebp-4] ← ebx
movl %eax, -4(%ebp) # eax ← MEM[ebp-4]
– RS dispatches memory uops when data for address calculation is ready,
and the MOB and Address Generation Unit (AGU) are free
– AGU computes the linear address

Segment-Base + Base-Address + (Scale*Index) + Displacement
Sends linear address to MOB, to be stored in Load Buffer or Store Buffer
 MOB resolves memory dependencies and enforces memory ordering
– Some memory dependencies can be resolved statically
store r1,a
load r2,b
 can advance load before store
– Problem: some cannot
store r1,[r3];
 load must wait till r3 is known
load r2,b
38
Computer Structure 2014 – P6 uArch
Load and Store Ordering
 x86 has small register set  uses memory often
– Preventing Stores from passing Stores/Loads: 3%~5% perf. loss

P6 chooses not allow Stores to pass Stores/Loads
– Preventing Loads from passing Loads/Stores: big perf. loss

P6 allows Loads to pass Stores, and Loads to pass Loads
 Stores are not executed OOO
– Stores are never performed speculatively

there is no transparent way to undo them
– Stores are also never re-ordered among themselves

The Store Buffer dispatches a store only when
 the store has both its address and its data, and
 there are no older stores awaiting dispatch
– Store commits its write to memory (DCU) at retirement
39
Computer Structure 2014 – P6 uArch
Store Implemented as 2 Uops
 Store decoded as two independent uops
– STA (store-address): calculates the address of the store
– STD (store-data): stores the data into the Store Data buffer

The actual write to memory is done when the store retires
 Separating STA & STD is important for memory OOO
– Allows STA to dispatch earlier, even before the data is known
– Address conflicts resolved earlier 
opens memory pipeline for other loads
 STA and STD can be issued to execution units in parallel
– STA dispatched to AGU when its sources (base+index) are ready
– STD dispatched to SDB when its source operand is available
40
Computer Structure 2014 – P6 uArch
Memory Order Buffer (MOB)
 Store Coloring
– Each Store allocated in-order in Store Buffer, and gets a SBID
– Each load allocated in-order in Load Buffer,
and gets LBID + current SBID
 Load is checked against all previous stores
– Stores with SBID ≤ load’s SBID
 Load blocked if
– Unresolved address of a relevant STAs
– STA to same address, but data not ready
– Missing resources (DTLB miss, DCU miss)
 MOB writes blocking info into load buffer
– Re-dispatches load when wake-up signal received
 If Load is not blocked  executed (bypassed)
41
LBID
SBID
Store
-
0
Store
-
1
Load
0
1
Store
-
2
Load
1
2
Load
2
2
Load
3
2
Store
-
3
Load
4
3
Computer Structure 2014 – P6 uArch
MOB (Cont.)
 If a Load misses in the DCU
– The DCU marks the write-back data as invalid
– Assigns a fill buffer to the load, and issues an L2 request
– When critical chunk is returned, wakeup and re-dispatch the load
 Store → Load Forwarding
– Older STA with same address as load and data ready
 Load gets its data directly from the SB (no DCU access)
 Memory Disambiguation
– MOB predicts if a load can proceed despite unknown STAs


Predict colliding  block Load if there is unknown STA (as usual)
Predict non colliding  execute even if there are unknown STAs
– In case of wrong prediction

42
The entire pipeline is flushed when the load retires
Computer Structure 2014 – P6 uArch
Pipeline: Load: Allocate
Schedule
Alloc
IDQ
AGU
LB
Write
Retire
RS
ROB
MOB DTLB DCU
WB
LB
 Allocate ROB/RS, MOB entries
 Assign Store Buffer ID (SBID) to enable
ordering
43
Computer Structure 2014 – P6 uArch
Pipeline: Bypassed Load: EXE
Alloc
IDQ
Schedule
AGU
LB
Write
Retire
RS
ROB
MOB DTLB DCU
WB
LB







44
RS checks when data used for address calculation is ready
AGU calculates linear address: DS-Base + base + (Scale*Index) + Disp.
Write load into Load Buffer
DTLB Virtual → Physical + DCU set access
MOB checks blocking and forwarding
DCU read / Store Data Buffer read (Store → Load forwarding)
Write back data / write block code
Computer Structure 2014 – P6 uArch
Pipeline: Blocked Load Re-dispatch
Alloc
IDQ
Schedule
AGU
LB
Write
Retire
RS
ROB
MOB DTLB DCU
WB
LB






45
MOB determines which loads are ready, and schedules one
Load arbitrates for MEU
DTLB Virtual → Physical + DCU set access
MOB checks blocking/forwarding
DCU way select / Store Data Buffer read
write back data / write block code
Computer Structure 2014 – P6 uArch
Pipeline: Load: Retire
Alloc
IDQ
Schedule
AGU
LB
Write
Retire
RS
ROB
MOB DTLB DCU
WB
LB
 Reclaim ROB, LB entries
 Commit results to RRF
46
Computer Structure 2014 – P6 uArch
Pipeline: Store: Allocate
Alloc
IDQ
Schedule
AGU
SB
RS
Retire
ROB
DTLB
SB
 Allocate ROB/RS
 Allocate Store Buffer entry
47
Computer Structure 2014 – P6 uArch
Pipeline: Store: STA EXE
Alloc
IDQ
Schedule
AGU
SB
V.A.
Retire
RS
ROB
DTLB
SB
P.A.
SB
 RS checks when data used for address calculation is ready
– dispatches STA to AGU
 AGU calculates linear address
 Write linear address to Store Buffer
 DTLB Virtual → Physical
 Load Buffer Memory Disambiguation verification
 Write physical address to Store Buffer
48
Computer Structure 2014 – P6 uArch
Pipeline: Store: STD EXE
Alloc
IDQ
Schedule
SB
data
RS
Retire
ROB
SB
 RS checks when data for STD is ready
– dispatches STD
 Write data to Store Buffer
49
Computer Structure 2014 – P6 uArch
Pipeline: Senior Store Retirement
Alloc
IDQ
Schedule
RS
Retire
ROB
MOB
SB
DCU
SB
 When STA (and thus STD) retires
– Store Buffer entry marked as senior
 When DCU idle  MOB dispatches senior store
 Read senior entry
– Store Buffer sends data and physical address
 DCU writes data
 Reclaim SB entry
50
Computer Structure 2014 – P6 uArch
Download