5, Introduction to OOO Execution and Register Renaming

advertisement
Out-of-Order Execution
Scheduling
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Instruction Level Parallel Processing
• Sequential Execution Semantics
• Out-of-Order Execution
– How it can help
– Issues:
• Maintaining Sequential Semantics
• Scheduling
– Scoreboard
• Register Renaming
• Initially, we’ll focus on Registers, Memory later on
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics - Review
• Instructions appear as if they executed:
– In the order they appear in the program
– One after the other
Program
Order
A. Moshovos ©
Pipelining
Superscalar
ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order
Out-of-Order Execution
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
1
r2
1
loop
Superscalar
fetch
decode
fetch
decode
fetch
decode
add
fetch
decode
sub
fetch
decode
decode
fetch
decode
fetch
decode
fetch
decode
fetch
decode
A. Moshovos ©
sum += a[++m];
i--;
} while (i != 0);
add
ld
bne
out-of-order
fetch
do {
add
ld
add
sub
bne
ECE1773 - Fall ‘07 ECE Toronto
Sequential Semantics?
• Execution does NOT adhere to sequential semantics
inconsistent
fetch
decode
fetch
decode
fetch
decode
fetch
decode
fetch
decode
add
ld
add
sub
bne
consistent
•
•
•
•
To be precise: Eventually it may
Simplest solution: Define problem away
Not acceptable today: e.g., Virtual Memory
Three-phase Instruction execution
– In-Progress, Completed and Committed
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Out-of-order Execution Issues
• Preserving Sequential Semantics
• Stalling Instructions w/ dependences
• Issuing Instructions when dependences are
satisfied
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Back to Sequential Semantics
• Instr. exec. in 3 phases:
– In-progress, Completed, Committed
– OOO for in-progress and Completed
– In-order Commits
• Completed - out-of-order: ”Visible only inside”
– Results visible to subsequent instructions
– Results not visible to outsiders
• On interrupts completed results are discarded
• Committed - in-order: ”Visible to all”
– Results visible to subsequent instructions
– Results visible to outsiders
• On interrupt committed results are preserved
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
How Completes Help w/ Performance
in-order
completes
out-of-order completes
in-order commits
DIV R3, _, _
ADD R1, _, _
ADD _, R1, _
In-order
commits
fetch
decode
fetch
decode
fetch
decode
fetch
decode
fetch
decode
add
commit
ld
commit
add
sub
commit
bne
complete
A. Moshovos ©
commit
ECE1773 - Fall ‘07 ECE Toronto
commit
Implementing Completes/Commits
• Key idea:
– Maintain sufficient state around to be able to rollback when necessary
– Roll-back:
• Discard (aka Squash) all not committed
• One solution (conceptual):
– Upon Complete instruction records previous value
of target register
– Upon Discard, instruction restores target value
– Upon Commit, nothing to do
• We will return to this shortly
• Focus on scheduling mechanisms
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Out-of-Order Execution Overview
Processing Phase
Static program
Dispatch/ dependences
dynamic inst.
Stream (trace)
inst. Issue
inst execution
inst. Reorder &
commit
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Committed
completed
instructions
Completed
execution
window
In-Progress
Program Form
Out-of-Order Execution: Stages
• Fetch: get instruction from memory
• Decode/Dispatch: what is it? What are the
dependences
• Issue: Go – all dependences satisfied
• Execute: perform operation
• Complete: result available to other insts.
• Commit: result available to outsiders
• We’ll start w/ Decode/Dispatch
• Then we’ll consider Issue
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
OOO Scheduling
• Instruction @ Decode:
– Do I have dependences yet to be satisfied?
– Yes, stall until they are
– No, clear to issue
• Wakeup Instructions Stalled:
– Dependences satisfied
– Allow instruction to issue
• Dependence:
– (later instruction, earlier instruction) & type
• We’ll first consider RAW and then move on to
WAW and WAR
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Stalling @ Decode for RAW
• Are there unsatisfied dependences?
– RAW: have to wait for register value
– We don’t really care who is producing the value
– Only whether it is available
• Can use the Register Availability Vector as in
pipelining/superscalar
– Also known as scoreboard
• At Decode
– Reset bit corresponding to your target
– At writeback set
– Check all bits for source regs: if any is 0 stall
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Issuing Instructions: Scheduling
• Determine when an instruction can issue
– Ignore resources for the time being
• Stalled because of RAW w/ preceding instruction
• Concept:
– Producer (write) notifies consumers (read)
• Requirements:
– Consumers need to be able to identify producer
– The register name is one possible link
• Mechanism
– Consumer placed in a reservation station
– Producers on complete broadcasts identity
– Waiting instructions observe
– Update Operand Availability
– Issue if all operands now available
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Reservation Station
• State pertaining to an instruction
– What registers it reads
– Whether they are available
– What is the destination register
– What state is the instruction in
• Waiting
• Executing
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Out-Of-Order Exec. Example
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
4 cycles lat
r2
1
loop
RAV
r1
r2
r3
r4
1
1
1
1
op
src1
Cycle 0
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
src2
tgt
status
Out-Of-Order Exec. Example: Cycle 0
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
5 cycles lat
r2
1
loop
Ready to be
executed
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
1
1
0
add
r4/1
NA/1
r4/0
Rdy
Cycle 0
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 1
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Notify those waiting for R4
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
0
1
1
add
r4/1
NA/1
r4
Exec
ld
r4/1
NA/1
r2
Rdy
R4 gets produced now
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 2
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Result available @ cycle 6
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
0
0
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Exec
add
r3/1
r2/0
r3
Wait
Wait for r2
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 3
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Result available @ cycle 6
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
0
0
0
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Exec
add
r3/1
r2/0
r3
Wait
sub
r1/1
NA/1
r1
Rdy
Wait for r2
No dependences
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 4
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Result available @ cycle 6
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
0
0
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Exec
add
r3/1
r2/0
r3
Wait
sub
r1/1
NA/1
r1
Exec
bne
r1/1
r0/1
NA
Rdy
Wait for r2
r1 produced now
Notify consumers
r1 will be available next cycle
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 5
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Result available @ cycle 6
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
0
0
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Exec
add
r3/1
r2/0
r3
Wait
sub
r1/1
NA/1
r1
Compl
bne
r1/1
r0/1
NA
Exec
Wait for r2
Completed
executing
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 6
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
RAV
Result available @ cycle 6
Notify consumers
r1
r2
r3
r4
op
src1
src2
tgt
status
1
1
0
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Exec
add
r3/1
r2/1
r3
Rdy
sub
r1/1
NA/1
r1
Compl
bne
r1/1
r0/1
NA
Exec
Wait for r2
Completed
executing
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 7
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
Notify consumers
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
1
1
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Cmtd
add
r3/1
r2/1
r3
Exec
sub
r1/1
NA/1
r1
Compl
bne
r1/1
r0/1
NA
Compl
Executing
Completed
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Cycle 8
loop:
add
ld
add
sub
bne
r4,
r2,
r3,
r1,
r1,
r4,
10(r4)
r3,
r1,
r0,
4
r2
1
loop
RAV
r1
r2
r3
r4
op
src1
src2
tgt
status
1
1
1
1
add
r4/1
NA/1
r4
Cmtd
ld
r4/1
NA/1
r2
Cmtd
add
r3/1
r2/1
r3
Cmtd
sub
r1/1
NA/1
r1
Cmtd
bne
r1/1
r0/1
NA
Cmtd
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Notifying Consumers
• Identity of Producer
• Uniquely Identify the Instruction
• Easily retrievable @ decode by others
– Target Register
• Recall we stall on WAR or WAW
– Functional Unit
• If not pipelined
– Place in instruction window
– PC? not. Why?
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Name Dependences and OOO
• WAW or WAR: We need to update register but
others are still using it
– add
r1, r1, 10
– sw
r1, 20(r2)
– add
r1, r3, 30
– sub
r2, r1, 40
• There is only one r1
– sw needs to see the value of 1st add
– sub needs to wait for 2nd add and not 1st
• Solution: Stall decode when WAW or WAR
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Detecting WAW and WAR
• WAW? Look at Scoreboard
– If bit is 0 then there is a pending write
– Stall
• WAR? Need to know whether all preceding
consumers have read the value
– Keep a count per register
– Increase at decode for all reads
– Decrease on issue
• More elegant solution via register renaming
– Soon
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
instructions
Window vs. Scheduler
A. Moshovos ©
• Window
– Distance between oldest and youngest instruction
that can co-exist inside the CPU
– Larger window  Potential for more ILP
• Scheduler
– Number of instructions that are waiting to be
issued
• Window
– Instructions enter at Fetch
– Exit at Commit
• Scheduler
– Instructions enter at Decode
– Leave at writeback
• Window >= Scheduler
– Can be the same structure
• In window but not in scheduler  completed
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding
• Schedule based on RAW dependences
• WAW and WAR cause stalls
– WAW at decode
– WAR at writeback
• Optimization: Why is this OK?
• Implemented in the CDC 6600 in ‘64
– 18 non-pipelined FUs
• 4 FP: 2 mul, 1 add, 1 div
• 7 MEM: 5 load, 2 store
• 7 INT: add, shift, logical etc.
• Centralized Control Scheme
– Controls all Instruction Issue
– Detects all hazards
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
MIPS/DLX w/ Scoreboarding
FP mul
FP mul
Register
File
FP divide
FP add
FP integer
scoreboard
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Overview
• Ignore IF and MEM for simplicity
• 4-stage execution
– Issue
Check for structural hazards
Check for WAW hazards
Stall until all clear
– ReadOp Check for RAW hazards
Wait until all operands ready
Read Registers
– Execute Execute Operations
Notify scoreboard when complete
– Write
Check for WAR hazards
Stall Write until all clear
• A completing instruction cannot write dest if an earlier
instruction has not read dest.
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Optimizations/Tricks
• WAW as in original OOO
• WAR is optimized
– Second Producer is allowed to execute up to
complete
– It is stalled there until preceding consumers
complete
• No Commit
– No precise interrupts
• Window is implemented in the scoreboard
• One entry per Functional Unit
– Recall not pipelined
– Instructions identified by FU id
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Organization
• Three structures
– Instruction Status
– Functional Unit Status
– Register Result Status
• Instruction Status
– Which stage the instruction is currently in
• Functional Unit Status: scheduling
–
–
–
–
–
–
Busy
OP
Fi
Fj, Fk
Qj, Qk
Rj, Rk
Operation
Dest. Reg.
Source Regs
FUs producing sources
Ready bits for sources
• Register Result Status: dep. determination
– Which FU will produce a register
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding explained
• Register status reg:
– Which FU produces the register
• Use at decode
– Source reg match is a RAW
– Target reg macth is a WAW stall
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Functional Unit Status
• Busy:
– resource allocation
• OP:
– what to do once issued (e.g., add, sub)
• Dest. Reg.:
– Where to write result
– To find WAR
• Fj, Fk
Source Regs
– for WAR: can’t write if consumers pending for previous value
of register (if FU not the same)
• Qj, Qk
FUs producing sources
– To wait for appropriate producer
• Rj, Rk
Ready bits for sources
– To determine when ready: all ready
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Scoreboarding Example
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10 F0
ADDD
F6
F8
k
R2
R3
F4
F2
F6
F2
Read
Execution Write
Issue operandscomplete Result
Functional Unit Status
Name
Integer
Mult1
Mult2
Add
Divide
Busy Op
No
No
No
No
No
dest S1
Fi
Fj
S2
Fk
FU for j
Qj
FU for k Fj?
Qk
Rj
Fk?
Rk
F4
F8
F10
F12
F30
Register result status
Clock
F0
F2
F6
FU
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
...
Example: Cycle 0
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD F0
F2
SUBD
F8
F6
DIVD
F10 F0
ADDD
F6
F8
k
R2
R3
F4
F2
F6
F2
Read
Execution Write
Issue operandscomplete Result
1
Functional Unit Status
Name
Integer
Mult1
Mult2
Add
Divide
Busy Op
yes LD
No
No
No
No
dest S1
Fi
Fj
F6
S2
Fk
FU for j
Qj
FU for k Fj?
Qk
Rj
Fk?
Rk
F10
F12
F30
Register result status
Clock
F0
FU
A. Moshovos ©
F2
F4
F6
F8
integer
...
ECE1773 - Fall ‘07 ECE Toronto
Example, contd.
• The rest you’ll find on the web site
• Go through it
• Source: Patterson
• Summary:
– Execution proceeds in an order dictated by
dependences
– RAW, WAR and WAW force ordering
– Tricks may be possible
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Beyond Simple OoO
A: LF
B: LF
C: MULF
D: SUBF
E: ADDF
F6,
F2,
F0,
F2,
34(R2)
45(R3)
F2,
F4
F8,
F2,
F7,
F4
F6
A
B
D
C
E
•
•
•
•
E will wait for B, C and D.
WAR w/ C and D
WAW w/ B
Can we do better?
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
What if we had infinite registers
A: LF
B: LF
C: MULF
D: SUBF
E: ADDF
F6,
F2,
F0,
A: LF
B: LF
C: MULF
D: SUBF
E: ADDF
F6,
F2,
F0,
F2,
F9,
34(R2)
45(R3)
F2,
F4
F8,
F2,
F7,
F4
34(R2)
45(R3)
F2,
F4
F8,
F2,
F7,
F4
F6
F6
No false dependences anymore
Since we do not reuse a name we can’t have WAW and WAR
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Why we can’t have Infinite Registers
• False/Name dependences (WAR and WAW)
– Artifact of having finite registers
• There is no such thing as infinite
• There is no such thing as large enough
– Well there is (in a sec.)
– Computers execute Billions of Instructions per
sec. Even a multi-billion register file would soon be
exhausted
• Want to exploit parallelism across several
instances of the same code
– Loops, recursive functions (most frequent part)
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Yes, there is “large enough”
• At any given point there will be a finite number of
instructions in the window
• if each instruction has a single register target
• if there are N instructions
• How many registers do we need?
• N?
• N + X?
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register Renaming
• Register Version
– Every Write creates a new version
– Uses read the last version
– Need to keep a version until all uses have read it.
• Register Renaming:
– Architectural vs. Physical Registers
• more phys. than arch.
– Maintain a map of arch. to phys. regs.
– Use in-order decoding to properly identify dependences.
– Instructions wait only for input op. availability.
– Only last version is written to reg. file.
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register Renaming
A: DIVF
F3,
B: SUBF
r2, -, C: MULF
r3, r2, D: SUBF
r4, r2, r1
Register Rename Table
E: ADDFF0 F1 F2
F2, F3
F: ADDF
A
R1
B r6, r3, r5R2 R1
C
D
E
F
R3
R3
R3
R6
R2
R2
R5
R5
R1
R1
R1
R1
F1,
F2,
F0
F1,
r1, -, F0
F0,
F2,
F4
F6,
F2,
F3
F5, F6 F4
F5
F7
F0,
F0,
-, ... r5,F30
F2
R4
R4
R4
Need more physical registers than architectural
Ignore control flow for the time being.
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Process
• Only need to remember last producer of each
architectural register
– Vector
• At decode
– Find the most recent producers for all source
registers
– After: declare self as most recent producer of
target register
• Complication:
– May have to retract
• Speculative Execution, e.g., interrupts
– Need to be able to restore the mapping state
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Register Renaming Support Structures
• Register Rename Table
– f(aR) = pR
– one entry per architectural Register
• Free Register List
– Lists not used Physical Registers
• At Decode
– grab a new register from the free list
– Change mapping in rename table
• At Commit
– Release Register? Not… Why?
– Could release previous version
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
How Many Physical Registers?
• Correctness:
– At least as many architectural plus?
• Performance:
– As many as possible
– Not correctness
– Recall not all instructions produce register results
• stores and branches
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Dynamic Scheduling
A: DIVF
B: SUBF
C: MULF
D: SUBF
E: ADDF
F: ADDF
F3, F1, F0
F2, F1, F0
F0, F2, F4
F6, F2, F3
F2, F5, F4
F0, F0, F2
r1,
r2,
r3,
r4,
r5,
r6,
-,
-,
r2,
r2,
-,
r3,
r1
r5
- Values and Names flow together
Name Value
- Writeback specifies both value and name
- A waiting instruction inspects all results
- It is allowed to execute when all inputs are available
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Physical Registers
• Physical register file is just one option
• What we need is separate storage
– Consumers could keep values in their reservation
station
– Tomasulo’s next
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm
• IBM 360/91 - Fast 360 for scientific code
– Completed in 1967
– Dynamic scheduling
– Predates cache memories
• Pipelined FUs
– Adder up to 3 instructions
– Multiplier up to 2 instructions
• Tomasulo vs. Scoreboard
– Distributed hazard detection and control
– Results are bypassed to FUs
– Common Data Bus (CDB) for results
• All results visible to all instead of via a register
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
DLX w/ Tomasulo
• Tomasulo’s Algorithm
– Use “tags” to identify data values
– Reservation stations distributed control
– CDB broadcasts all results to all RSs
• Extend DLX as example
– Assume multiple FUs than pipelined
– Main difference is Register-Memory Insts.
• I.e., DLX does not have them
• But that’s really a detail :-)
• Physical Registers?
– Not really. What we need is different storage and name for
every version.
– Here it’s the producing reservation station
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Dynamic DLX
Operation Stack
Registers
RS
RS
adders
Mults
CDB
Store buffers
Load buffers
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm
• 3 major steps
– Dispatch
• Get instruction from fetch queue
• ALU op: check for available RS
• Load: Check for available load buffer
• If available: dispatch and copy read regs to RS or load
buffer
• if not: stall - structural hazard
– Issue
• If all ops are available: issue
• If not monitor CDB for operands
– Complete
• If CDB available, broadcast result
• else stall
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm contd.
• Reservation stations
– Handle distributed hazard detection and instruction control
• Everything receiving data get its tag
– 4-bit tag specifies reservation station or load buffer
– Also which FU will produce result
• Register specifier is used to assign tags
– Then they are discarded
– Input register specifiers are ONLY used in dispatch.
(Rename table)
• Common Data Bus:
– value + “tag” = where this comes from
– vs. typical bus: value + “tag” = where this goes to
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s Algorithm Contd.
• Reservation Stations
– Op
Opcode
– Qj, Qk Tag Fields (source ops)
– Vj, Vk Operand values (source ops)
– Busy Currently in use
• Register file and Store Buffer
– Qi
Tag field
– Busy Currently in use
– Vi
Value
• Load Buffers
– Busy Currently in Use
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s: Understanding Speculative vs. Architectural State
• add r1, r2, 10
• sub r4, r1, 20
• add r1, r3, 30
Register file
Arch.
Reg. Name
I have it
Value of r1
I have it
Value of r2
I have it
Value of r3
I have it
Value of r4
Reservation Stations
tgt
src1
Can be: “I have it”,
“reservation station id”
src2
NA
NA
Value of Src1
NA
Value of Src2
NA
NA
Value of Src1
NA
Value of Src2
Reg Arch.
name
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming 1st Instruction
• add r1, r2, 10
• sub r4, r1, 20
• add r1, r3, 30
Register file
• Read sources (r2)
• Rename r1 to RS0
RS0
-----
I have it
Value of r2
I have it
Value of r3
I have it
Value of r4
Reservation Stations
tgt
RS0
src1
src2
r1
I have it
Value of R2
I have it
10
NA
NA
Value of Src1
NA
Value of Src2
NA
NA
Value of Src1
NA
Value of Src2
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming 2nd Instruction
• add r1, r2, 10
• sub r4, r1, 20
• add r1, r3, 30
Register file
• Sources: r1 in RS0 NYA
• Rename r4 to RS1
RS0
-----
I have it
Value of r2
I have it
Value of r3
RS1
----
Reservation Stations
tgt
RS1
src1
src2
r1
I have it
Value of R2
I have it
10
r4
RS0
----
I have it
20
NA
NA
Value of Src1
NA
Value of Src2
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Renaming 3rd Instruction
• add r1, r2, 10
• sub r4, r1, 20
• add r1, r3, 30
Register file
• Sources: r3 Avail.
• Rename r1 to RS2
RS2
-----
I have it
Value of r2
I have it
Value of r3
RS1
----
Reservation Stations
tgt
RS2
src1
src2
r1
I have it
Value of R2
I have it
10
r4
RS0
----
I have it
20
r1
I have it
Value of R3
I have it
30
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 0
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD
F0
F2
SUBD F8
F6
DIVD F10
F0
ADDD F6
F8
Reservation Stations
Time Name Busy
0 Add1 No
0 Add2 No
0 Add3 No
0 Mult1 No
0 Mult2 No
Register result status
F0
F2
Issue
Execution
complete
Write
Result
Op
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
F4
F6
F8
F10
...
k
R2
R3
F4
F2
F6
F2
FU
load buffers
Busy Address
Load1 No
Load2 No
Load3 No
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 1
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD
F0
F2
SUBD F8
F6
DIVD F10
F0
ADDD F6
F8
Reservation Stations
Time Name Busy
0 A1
No
0 A2
No
0 A3
No
0 M1
No
0 M2
No
Register result status
F0
F2
Execution
complete
Write
Result
Op
S1
Vj
S2
Vk
RS for j
Qj
RS for k
Qk
F4
F6
F8
F10
...
k
R2
R3
F4
F2
F6
F2
FU
L1
L2
L3
Issue
1
L1
load buffers
Busy Address
yes
34+R2
No
No
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Example: cycle 3
Instruction status
Instruction
j
LD
F6
34+
LD
F2
45+
MULTD
F0
F2
SUBD F8
F6
DIVD F10
F0
ADDD F6
F8
Reservation Stations
Time Name Busy
0 A1
No
0 A2
No
0 A3
No
0 M1
Yes
0 M2
No
Register result status
FU
L1
L2
L3
F0
F2
M1
L2
k
R2
R3
F4
F2
F6
F2
Op
S1
Vj
Mul
F4
load buffers
Busy Address
yes
34+R2
No
45+R3
No
A. Moshovos ©
Issue
1
2
3
F6
Execution
complete
3
Write
Result
S2
Vk
RS for j
Qj
R(F4)
L2
F8
F10
RS for k
Qk
...
L1
- Mul is issued vs. scoreboard
- What’s waiting for L1?
ECE1773 - Fall ‘07 ECE Toronto
Example…
• Check the web site…
• Too much for in-class
• Summary:
– Execution proceeds in any order that does not
violate RAW dependences
– WAR and WAW are removed
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s vs. Scoreboard
Instruction status
Execution Write
Instruction j
k Issue complete Result
LD F6 34+ R2
1
3
4
LD F2 45+ R3
2
4
5
MULTD
F0 F2
F4
3
15
16
SUBD
F8 F6
F2
4
7
8
DIVDF10 F0
F6
5
56
57
ADDD
F6 F8
F2
6
10
11
Scoreboard:
Instruction status
Instruction
j
LD
F6 34+
LD
F2 45+
MULTD F0 F2
SUBD
F8 F6
DIVD
F10 F0
ADDD
F6 F8
A. Moshovos ©
- In-order issue
- Out-of-order execution
Read Execution Write
k Issue operandscomplete Result- Out-of-order completion
R2
R3
F4
F2
F6
F2
1
5
6
7
8
13
2
6
9
9
21
14
3
7
19
11
61
16
4
8
20
12
62
22
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s
• Out-of-order loads and stores?
– What about WAW, RAW and WAR?
– Compare all load addresses against the addresses of all
preceding store buffers
– Stall if they match
• CDB is a bottleneck
– One write per cycle
– Could duplicate
– But, come at a cost
– Datapath + duplicated tags and control
• Complex Implementation
– Scalability?
– All results to all sources
– What if we want 128 instrs?
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Tomasulo’s
• Advantages
– Distribution of hazard detection
– Elimination of WAR and WAW stalls
• Common Data Bus
– Broadcasts result to multiple instrs (+)
– Bottleneck
• Register Renaming
– Removes WAR and WAW hazards
– More interesting when same code appears twice
• Think of loops
• More on this later
– BUT: Associative lookups
– RECALL: direct map is faster
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
In Summary
Feature
Scoreboarding Tomasulo's
CDC6600
IBM 360
Structural
Stall in Issue for
FU
RAW
Via Registers
Stall in Dispatch
for RS
Stall in RS for FU
From CDB
WAR
Stall in WB
Copy Value to RS
WAW
Stall in Issue
Register Renaming
Logic
Centralized
Distributed
Bottlenecks
No Register
One CDB
Bypass
Stall in issue block
A. Moshovos ©
ECE1773 - Fall ‘07 ECE Toronto
Download