AMA-L07-RegisterRenaming

advertisement
Lecture 7: Register Renaming
Read-After-Write
Write-After-Read
Write-After-Write
A: R1 = R2 + R3
B: R4 = R1 * R4
A: R1 = R3 / R4
B: R3 = R2 * R4
A: R1 = R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
7
5 A 7
-2
-2
-2
9 B 9
9
3
21
3
R1
R2
R3
R4
5 A 3
3
B
-2
-2
-2
9
9
-6
3
3
3
R1
R2
R3
R4
5 A 7 B 27
-2
-2
-2
9
9
9
3
3
3
R1
R2
R3
R4
5
5 A 7
-2
-2
-2
9 B 9
9
3
15
15
R1
R2
R3
R4
5
5 A -2
B
-2
-2
-2
-6
9
-6
3
3
3
R1
R2
R3
R4
5 B 27 A 7
-2
-2
-2
9
9
9
3
3
3
Lecture 7: Register Renaming
2
• Register Data Dependencies (this lecture)
–
–
–
–
Output dependence (WAW), also do
Anti-dependence (WAR), da
True dependence (RAW), dt
Why is RAR not a dependency?
• Memory Data Dependencies (later lecture)
• Control Dependencies (earlier lectures)
• Structural Dependencies
– Instruction must wait until some “structure” is available
• Ex: Divider, ROB entry, Branch color/tag, etc.
Lecture 7: Register Renaming
3
• WAR dependencies are from reusing registers
A: R1 = R3 / R4
B: R3 = R2 * R4
R1
R2
R3
R4
5 A 3
3
B
-2
-2
-2
9
9
-6
3
3
3
A: R1 X
= R3 / R4
B: R5 = R2 * R4
R1
R2
R3
R4
5
5 A -2
B
-2
-2
-2
9
-6
-6
3
3
3
R1
R2
R3
R4
R5
5
5 A 3
B
-2
-2
-2
9
9
9
3
3
3
4
-6
-6
With no dependencies, reordering
still produces the correct results
Lecture 7: Register Renaming
4
• WAW dependencies are also from reusing registers
A: R1 = R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
5 A 7 B 27
-2
-2
-2
9
9
9
3
3
3
A: R5X= R2 + R3
B: R1 = R3 * R4
R1
R2
R3
R4
5 B 27 A 7
-2
-2
-2
9
9
9
3
3
3
R1
R2
R3
R4
R5
5 B 27 A 27
-2
-2
-2
9
9
9
3
3
3
4
4
7
Same solution works
Lecture 7: Register Renaming
5
• Finite number of registers
– At some point, you’re forced to overwrite somewhere
– Most RISC: 32 registers, x86: only 8, x86-64: 16
• Loops, Code Reuse
– If you write a value to R1 in a loop body, then R1 will be
reused every iteration  induces many false dep’s
– Loop unrolling can help a little
• Will run out of registers at some point anyway
• Trade off with code bloat
– Short function calls can result in similar register reuse
• Inlining can help a little
Lecture 7: Register Renaming
6
• Add more registers to the ISA? BAD!!!
– Changing the ISA can break binary compatibility
• x86-64 mostly doesn’t break compatibility, but it’s a hack
– All code must be recompiled
– Does not address register overwriting due to code reuse
from loops and function calls
– Not a scalable solution
Lecture 7: Register Renaming
7
• Processor has more registers than specified by the
ISA  temporarily map ISA registers (“logical” or
“architected” registers) to the physical registers to
avoid overwrites
• Components:
– mapping mechanism
– physical registers
• allocated vs. free registers
• allocation/deallocation mechanism
– state maintenance (commit, mispredictions, etc.)
Lecture 7: Register Renaming
8
Architected
Registers
R0
R1
R2
R3
R4
R5
R6
R7
Physical
Registers
T0
T2
T4
T6
T8
T10
T12
T14
T16
T18
T20
T22
Tn-2
Lecture 7: Register Renaming
T1
T3
T5
T7
T9
T11
T13
T15
T17
T19
T21
T23
Tn-1
Original
Code
R2 = R1+R3
R4 = R2 - R6
…
R2 = R7 / R5
BEQ R2, #1
…
R2 = R4 * R1
R6 = Load [R2]
WAW
WAR
Renamed
Code
T1 = R1+R3
R4 = T1 - R6
…
T20 = R7 / R5
BEQ T20, #1
…
T7 = R4 * R1
R6 = Load [T7]
No False
Dependencies!
9
Unmapped
Physical
Registers
Dest = Src1 op Src2
Mapping
Mechanism
TagD
Src1  TagS1
Src2  TagS2
TagD =
TagS1 op TagS2
Dest  TagD
Repeat for each instruction
Lecture 7: Register Renaming
10
• Lookup Table
– One entry per architected register
– Entry stores physical location of most recent version of the
logical register
– Most recent version may be in the physical register file or
in the architected register file
RAT
ARF
PRF
Lecture 7: Register Renaming
11
R1 = R2 + R3
T13 = R2 + R3
R0 R1 R2 R3 R4 R5 R6 R7
- - - - - - - -
Free PRegs
T13, T14, T9, T7
R5 = R4 – R1
T14 = R4 + T13
-
13
-
-
-
-
-
-
T14, T9, T7
R1 = R1 * R5
T9 = T13 * T14
-
13
-
-
-
14
-
-
T9, T7
R2 = R5 / R1
T7 = T14 / T9
-
9
-
-
-
14
-
-
T7
-
9
7
-
-
14
-
-
Lecture 7: Register Renaming
12
R1 = R2 + R3
R4 = R5 – R7
R3 = R0 / R2
R5 = Ld 12[R6]
From free
register pool
Don’t rename
immediates
T10
T31
T19
T6
Lecture 7: Register Renaming
T16
T39
T14
T5
T23
T7
T16
X
RAT
For N-wide
superscalar:
2N RAT read-ports
N RAT write-ports
13
R1 = R2 + R3
R4 = R5 – R7
R3 = R0 / R1
R5 = Ld 12[R6]
T16
T39
T14
T5
From free
register pool
RAT
T10
T31
T19
T6
Lecture 7: Register Renaming
T23
T7
T16
X
This is the wrong
version of R1
Should be using
this version of R1
14
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
From free
register pool
RAT
T10
T31
T19
T6
Lecture 7: Register Renaming
T16
T34
T16
T16
T34
T16
T34
T34
T16
T10
T31
T31
T34
T16
T10
T19
Result of
sequential
renaming
15
Inst 0
Inst 1
Inst 2
Inst 3
Intra-Group
Dependency
Checker
Src L
Src R
Dest
RAT
From free
register pool
Lecture 7: Register Renaming
Not needed since 1st inst
in a group has no earlier
insts to be dependent on
T0L
T0R
T1L
T1R
T2L
Similarly, src1L and src1R
cannot be
T3Ldependent
on dst1, dst2 or dst3
T2R
T3R
16
src0Lsrcsrc
1L 0R src1R
src2L
src2R
src3L
src3R
dst0
dst1
dst2
dst3
=
R1L
R1R
=
R2L
R2R
=
R3L
=
=
=
R3R
=
=
=
=
=
T1L
T1R
T2L
T2R
Total number of comparisons:
0 1
=
2 (n(n-1)) / 2 =
= O(n2)
Lecture 7: Register Renaming
n2
–n
T3L
T3R
N-wide rename
has O(N) gate delay?
17
src7R
dst0
dst6
dst7
R7R
=
=
=
=
=
=
=
Gate delay reduced
down to O(log2N)
T7R
Lecture 7: Register Renaming
18
R1 = R2 + R1
R2 = R1 – R2
R1 = R2 / R1
R1 = R2 >> R1
Only this mapping
for R1 should be
written into the RAT
Condition: use mapping
if instruction is last
writer to the register
Lecture 7: Register Renaming
dst0
dst1 dst2 dst3
!=
!=
use dst0
!=
!=
use dst1
!=
!=
use dst2
1
use dst3
19
R3
ARF
R3
When an instruction commits,
it updates the ARF with the
new value
RAT
PRF
Free Pool
Architected register file contains
the committed/non-speculative
processor state
T42
The ARF now contains the
correct value; update the RAT
T42 is no longer needed, return
to the physical register free pool
Lecture 7: Register Renaming
20
R3
ARF
R3
RAT
T17
PRF
Free Pool
T42
Update ARF as usual
Deallocate physical register
Don’t touch the RAT!
(Someone else is the most
recent writer to R3)
At some point in the future,
the newer writer of R3 commits
This instruction was the most
recent writer, now update the RAT
Deallocate physical register
Lecture 7: Register Renaming
21
• Unified with the ROB
oldest
ROB_head
Instructions
in program
order
ROB_tail
Lecture 7: Register Renaming
ROB
PRF
inst
inst
inst
inst
inst
inst
inst
inst
inst
data
data
data
data
data
data
data
data
data
inst
data
22
• Free registers = all entries from ROB_tail to
ROB_head – 1
• Instructions allocated into ROB in-order, so physical
registers also allocated in same order
–
–
–
–
–
dsti = T [ROB_head]
dsti+1 = T [ (ROB_head +1) % ROB_size ]
dsti+2 = T [ (ROB_head +2) % ROB_size ]
…
dsti+N-1 = T [ (ROB_head +N-1) % ROB_size ]
Lecture 7: Register Renaming
23
• No need to explicitly manage free pool
– just increment ROB_tail as physical registers are allocated,
increment ROB_head as registers are deallocated
• Inefficiency: allocate registers to all instructions
– Branches, stores (and some other insts) don’t need physical
registers
• Asymmetric datapath – sometimes read values from
ARF, sometimes from the PRF
– requires both structures to be heavily ported
Lecture 7: Register Renaming
24
• Combine both ARF and PRF into a single register file
– Before, ARF and PRF could be the same hardware
structure, but they have distinct name spaces
• e.g., ARF (R0-R7) mapped to T0-T7 and PRF mapped to T8-T99
– For a unified RF, the committed R0 could be mapped
anywhere (T0-T99)
• Need some way to track the “committed” state
Lecture 7: Register Renaming
25
Speculative
RAT
R0
R1
R2
R3
R4
R5
R6
R7
Committed
RAT
R0
R1
R2
R3
R4
R5
R6
R7
The committed RAT along
with the pointed at registers
implement the logical
equivalent of the ARF
The speculative RAT tracks
the locations of the most
recent version of each
architected register
Both RATs may point to
the same physical location
(R0, R5): the most recent
writer has also committed
Lecture 7: Register Renaming
26
Register File
A: R1 = R2 + R4
T8 = T2 + T4
B: R4 = R2 – R7
T9 = T2 + T7
C: R2 = R1 * R4
T10 = T8 * T9
D: R1 = R1 + #1
T11 = T8 + #1
E: R7 = R4 / R1
T1 = T9 + T11
ROB
A
B
C
D
E
Speculative
RAT
R0
R1
R2
R3
R4
R5
R6
R7
Committed
RAT
R0
R1
R2
R3
R4
R5
R6
R7
Lecture 7: Register Renaming
T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
Free Pool
T1
T8
T9
T4
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
27
• Previous example showed a stack data structure
(LIFO)
TOS
T8
Stack HW isT17
complex due to need
To 4-wide
Towrite
4-wide
Renameto simultaneously
T23
read and
theT25
top-of-stack Rename
T34
T1
3 regs
T9
allocated
3 regs
allocated
Lecture 7: Register Renaming
From commit
T8
T17
T23
T25
T34
T1
T9
T13
T28
28
• A queue structure (FIFO) is easier to implement
– independent reading/writing of head and tail
Pool Tail
T8
T17
T23
T25
T34
T1
T9
T13
T28
Pool Head
3 regs
allocated
2 regs
deallocated
• Corner case still exists when pool is empty
– Either stall rename for one cycle or need more complex
HW to bypass dealloc’d registers to the renamer
Lecture 7: Register Renaming
29
ARF
br
RAT
?!?
ARF state corresponds to state prior
to oldest non-committed instruction
As instructions are processed, the RAT
corresponds to the register mapping after
the most recently renamed instruction
On a branch misprediction, wrong-path
instructions are flushed from the machine
The RAT is left with an invalid set of
mappings corresponding to the wrongpath instruction state
Lecture 7: Register Renaming
30
ARF
Allow all instructions to execute and
commit; ARF corresponds to last
committed instruction
RAT
ARF now corresponds to the state
right before the next instruction to
be renamed (foo)
br
X
Reset RAT so that all mappings
refer to the ARF
?!?
Resume renaming the new correctfoo Very simple
Pros:
path instructions from fetch
to implement
Correct path
Cons: Performance
loss instructions from fetch;
rename because RAT is wrong
due tocan’t
stalls
Lecture 7: Register Renaming
31
ARF
At each branch, make a copy of the RAT
(register mapping at the time of the branch)
br
br
RAT
br
br
foo
RAT
RAT
RAT
RAT
Checkpoint
Free Pool
On a misprediction:
1. flush wrong-path instructions
2. deallocate RAT checkpoints
3. recover RAT from checkpoint
4. resume renaming
Lecture 7: Register Renaming
32
• No need to stall front-end (?)
– need to “flash copy” RAT
• both for making checkpoints and recovering
– need some way to “hunt down” wrong-path checkpoints
for deallocation
• can “walk” the ROB, but this may take more than one cycle which
may introduce stalls; still faster than stall-and-drain
• More hardware
– need one checkpoint per branch
– what if the code has nothing but branches?
• worst case needs one checkpoint per ROB entry
• can assign one checkpoint per branch color
– stall front-end when out of branch colors/checkpoints
Lecture 7: Register Renaming
33
•
•
Each register-writing ROB entry tracks two physical
registers
1. Its allocated destination register
2. The previous physical register mapping for it architected
register
Example
– R1 mapped to T23
– Rename new instruction X, which overwrites R1
•
•
•
•
•
R1 now mapped to T19
X also records the value of an “undo mapping” of T23
– Recovery: walk ROB backwards applying the undo mappings
Lower overhead: don’t need full copies of the RAT
Slower?: need to walk the ROB
Flexibility: can recover to any instruction; not just
branches
Lecture 7: Register Renaming
34
• For ROB-based PRF, deallocation is simple:
– ROB_tail reset to point right after the mispredicted branch
• For unified RF, allocated registers may be anywhere in
the register file
PReg Free Pool
br
Some sortst of ROB walk still
required to deallocate the
wrong-pathbrPRegs; do at same
time with checkpoint deallocation
Lecture 7: Register Renaming
Committed
RAT
35
3N ports:
2N read,
1N write
RAT
Highly ported
SRAM
Typical N=3,4
|ARF| = 60-100
|PRF| = 100±
Only 60-100 bytes,
but 9-12 ports
SRAM latency typically
quadratic w.r.t. #ports
Lecture 7: Register Renaming
1 entry per architected register:
includes int, FP, MMX/SSE, lo/hi (MIPS),
control registers, FP status, predicate
registers (IA64), flags (x86), etc.
Each entry is log2 |PRF|  bits wide,
plus 1 valid bit when RF not unified
(!valid  register is in the ARF)
Dep Check Logic
Almost full pairwise
dependency checks:
O(N2) comparisons
36
• SRAM lookup easily pipelined
• Dependency check is just combinatorial logic; easily
pipelined
• What if there’s a
dependency
ABCD ABCD
renamed
between groups?
ABCD
ABCD
ABCD
REN1
REN2
Lecture 7: Register Renaming
ABCD
EFGH
ABCD
EFGH
ABCD
ABCD haven’t
updated the RAT
when EFGH
reads the RAT
37
• Similar to intra-group dependency checking, now
must perform inter-group dependency checking
Register mappings
if no dependencies
Overrides if
dependency exists
between ABCD
EFGH
ABCD ABCD
EFGH
ABCD
ABCD
EFGH ABCD
ABCD
EFGH
Overrides if
dependency exists
between ABCD and EFGH
EFGH
ABCD ABCD
REN1
Lecture 7: Register Renaming
REN2
38
1ns/cycle, 1GHz
0.5ns/cycle, 2GHz
0.32ns/cycle, 3.14GHz
Original renaming
Overhead due to
pipelined rename
Lecture 7: Register Renaming
Original renaming
39
• More stages
– higher branch mispredict penalty
– a lot more implementation complexity
• dep check with previous group, prev-prev group, etc.
• pipeline control logic, latching overhead
• more circuits ( area,  power), more design effort
• Higher frequency
– more performance if pipeline not overly exposed
• need sufficiently high branch prediction accuracy
• power goes up even more (P=½CV2fa)
– This is on top of the extra power for the extra circuits
– Extra logic effectively increases the C term
Lecture 7: Register Renaming
40
• How big should the physical register file be?
– ROB-based: PRF entries == ROB entries
– Unified: ???
• Should have one register per instruction
– How to count instructions?
– Every instruction from rename to retire
• instructions in fetch/decode stages haven’t been renamed, and
therefore don’t need physical registers
• Not every instruction needs a register (branches, stores)
• How many instructions does this add up to?
– N × Stages(Rename to Dispatch) + ROB_size
– Less those expected to not need destinations
Lecture 7: Register Renaming
41
IF
ID REN Disp
1. No register
allocated
2. Register allocated,
but contents are
bogus
PRF needs to be large
enough for all instructions
in Region 2, but none of
the registers will contain
anything useful!
Lecture 7: Register Renaming
ROB
Commit
RS
3. Register
contains
valid data
This is the only
time a physical
storage location
is really needed
Actually, only
needed until last
consumer reads
the value
4. Overwriter
commits;
register has
stale value;
deallocate
42
Download