ppt - Computer Science Division - University of California, Berkeley

advertisement
CS252
Graduate Computer Architecture
Lecture 7
Dynamic Scheduling 2:
Precise Interrupts
February 9th, 2010
John Kubiatowicz
Electrical Engineering and Computer Sciences
University of California, Berkeley
http://www.eecs.berkeley.edu/~kubitron/cs252
Registers
FP Mult
FP Mult
FP Divide
FP Add
Integer
SCOREBOARD
2/9/2011
CS252-S11, lecture 7
Functional Units
Review: Scoreboard (CDC 6600)
Memory
2
Review: Four Stages of Scoreboard Control
• Issue—decode instructions & check for structural hazards
– Instructions issued in program order (for hazard checking)
– Don’t issue if structural hazard
– Don’t issue if instruction is output dependent on any previously issued but
uncompleted instruction (no WAW hazards)
• Read operands—wait operands ready, then read them
– All real dependencies (RAW hazards) resolved in this stage
– No forwarding of data in this model!
• Execution—operate on operands
– The functional unit begins execution upon receiving operands. When the
result is ready, it notifies the scoreboard that it has completed execution.
• Write result—finish execution
– Stall if WAR hazards with previous instructions:
DIVD
F0,F2,F4
ADDD
F10,F0,F8
SUBD
F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD reads operands
Example:
2/9/2011
CS252-S11, lecture 7
3
Review: Tomasulo Organization
FP Registers
From Mem
FP Op
Queue
Load Buffers
Load1
Load2
Load3
Load4
Load5
Load6
Store
Buffers
Add1
Add2
Add3
Mult1
Mult2
FP adders
Reservation
Stations
To Mem
FP multipliers
Common Data Bus (CDB)
2/9/2011
CS252-S11, lecture 7
4
Review: Three Stages of Tomasulo Algorithm
1. Issue—get instruction from FP Op Queue
If reservation station free (no structural hazard),
control issues instr & sends operands (renames registers).
2. Execution—operate on operands (EX)
When both operands ready then execute;
if not ready, watch Common Data Bus for result
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting units;
mark reservation station available
• Normal data bus: data + destination (“go to” bus)
• Common data bus: data + source (“come from” bus)
– 64 bits of data + 4 bits of Functional Unit source address
– Write if matches expected Functional Unit (produces result)
– Does the broadcast
2/9/2011
CS252-S11, lecture 7
5
Review: Compare to Scoreboard Cycle 62
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
Read Exec Write
k Issue Oper Comp Result
R2
R3
F4
F2
F6
F2
1
5
6
7
8
13
2
6
9
9
21
14
3
7
19
11
61
16
4
8
20
12
62
22
Exec Write
Issue ComplResult
1
2
3
4
5
6
3
4
15
7
56
10
4
5
16
8
57
11
• Why take longer on scoreboard/6600?
• Structural Hazards
• Lack of forwarding
2/9/2011
CS252-S11, lecture 7
6
Review: Loop Example Cycle 9
Instruction status:
ITER Instruction
1
1
1
2
2
2
LD
MULTD
SD
LD
MULTD
SD
F0
F4
F4
F0
F4
F4
j
k
0
F0
0
0
F0
0
R1
F2
R1
R1
F2
R1
1
2
3
6
7
8
9
Vj
S1
Vk
S2
Qj
Reservation Stations:
Time
Exec Write
Issue CompResult
Name Busy Op
Add1
No
Add2
No
Add3
No
Mult1 Yes Multd
Mult2 Yes Multd
RS
Qk
R(F2) Load1
R(F2) Load2
Busy Addr
Fu
Load1
Load2
Load3
Store1
Store2
Store3
Yes
Yes
No
Yes
Yes
No
80
72
80
72
Mult1
Mult2
Code:
LD
MULTD
SD
SUBI
BNEZ
F0
F4
F4
R1
R1
0
F0
0
R1
Loop
R1
F2
R1
#8
...
F30
Register result status
Clock
9
R1
72
F0
Fu Load2
F2
F4
F6
F8
F10 F12
Mult2
• Dataflow graph constructed completely in hardware
• Renaming detaches early iterations from registers
2/9/2011
CS252-S11, lecture 7
7
Explicit Register Renaming
• Tomasulo provides Implicit Register Renaming
– User registers renamed to reservation station tags
• Explicit Register Renaming:
– Use physical register file that is larger than number of registers specified by ISA
• Keep a translation table:
– ISA register => physical register mapping
– When register is written, replace table entry with new register from freelist.
– Physical register becomes free when not being used by any instructions in
progress.
• Pipeline can be exactly like “standard” DLX pipeline
– IF, ID, EX, etc….
• Advantages:
–
–
–
–
2/9/2011
Removes all WAR and WAW hazards
Like Tomasulo, good for allowing full out-of-order completion
Allows data to be fetched from a single register file
Makes speculative execution/precise interrupts easier:
» All that needs to be “undone” for precise break point
is to undo the table mappings
CS252-S11, lecture 7
8
Registers
FP Mult
FP Mult
FP Divide
FP Add
Integer
SCOREBOARD
Functional Units
Question: Can we use explicit
register renaming with scoreboard?
Memory
Rename
Table
2/9/2011
CS252-S11, lecture 7
9
Scoreboard Example
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
Functional unit status:
Op
dest
Fi
S1
Fj
S2
Fk
Register Rename and Result
Clock
F0 F2
F4
F6
F8 F10 F12
P4
P6
Time Name
Int1
Int2
Mult1
Add
Divide
FU
Busy
FU
Qj
FU
Qk
Fj?
Rj
Fk?
Rk
...
F30
No
No
No
No
No
P0
P2
P8
P10
P12
P30
• Initialized Rename Table
2/9/2011
CS252-S11, lecture 7
10
Renamed Scoreboard 1
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Busy
Op
dest
Fi
Yes
No
No
No
No
Load
P32
Register Rename and Result
Clock
F0 F2
1
FU
P0
P2
S1
Fj
S2
Fk
FU
Qj
FU
Qk
Fj?
Rj
R2
F4
F6
P4
P32
Yes
F8 F10 F12
P8
Fk?
Rk
P10
P12
...
F30
P30
• Each instruction allocates free register
• Similar to single-assignment
compiler transformation
2/9/2011
CS252-S11, lecture 7
11
Renamed Scoreboard 2
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
2
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Busy
Op
dest
Fi
Yes
Yes
No
No
No
Load
Load
P32
P34
Register Rename and Result
Clock
F0 F2
2
2/9/2011
FU
P0
P34
S1
Fj
S2
Fk
FU
Qj
FU
Qk
Fj?
Rj
R2
R3
F4
F6
P4
P32
CS252-S11, lecture 7
Yes
Yes
F8 F10 F12
P8
Fk?
Rk
P10
P12
...
F30
P30
12
Renamed Scoreboard 3
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
2
3
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Busy
Op
dest
Fi
Yes
Yes
Yes
No
No
Load
Load
Multd
P32
P34
P36
P34
F4
F6
P4
P32
Register Rename and Result
Clock
F0 F2
3
2/9/2011
FU
3
P36
P34
S1
Fj
S2
Fk
R2
R3
P4
CS252-S11, lecture 7
Fj?
Rj
Fk?
Rk
Int2
No
Yes
Yes
Yes
F8 F10 F12
...
F30
P8
FU
Qj
P10
FU
Qk
P12
P30
13
Renamed Scoreboard 4
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
2
3
3
4
4
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
Yes
Yes
Yes
No
Load
Multd
Sub
P34
P36
P38
P34
P32
R3
P4
P34
F4
F6
F8 F10 F12
P4
P32
P38
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
4
2/9/2011
FU
P36
P34
CS252-S11, lecture 7
FU
Qj
FU
Qk
Int2
Int2
P10
P12
Fj?
Rj
Fk?
Rk
No
Yes
Yes
Yes
No
...
F30
P30
14
Renamed Scoreboard 5
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
2
3
3
4
4
5
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
No
Yes
Yes
Yes
Multd
Sub
Divd
P36
P38
P40
P34
P32
P36
P4
P34
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
5
2/9/2011
FU
P36
P34
Fj?
Rj
Fk?
Rk
Mult1
Yes
Yes
No
Yes
Yes
Yes
F6
F8 F10 F12
...
F30
P32
P38
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
P30
15
Renamed Scoreboard 6
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
2
3
6
6
Functional unit status:
Time Name
Int1
Int2
10 Mult1
2 Add
Divide
2/9/2011
FU
4
5
S1
Fj
S2
Fk
Busy
Op
dest
Fi
No
No
Yes
Yes
Yes
Multd
Sub
Divd
P36
P38
P40
P34
P32
P36
P4
P34
P32
Mult1
Yes
Yes
No
F4
F6
F8 F10 F12
...
P4
P32
P38
Register Rename and Result
Clock
F0 F2
6
3
4
P36
P34
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
Yes
F30
P30
16
Renamed Scoreboard 7
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
2
3
6
6
Functional unit status:
Time Name
Int1
Int2
9 Mult1
1 Add
Divide
2/9/2011
FU
4
5
S1
Fj
S2
Fk
Busy
Op
dest
Fi
No
No
Yes
Yes
Yes
Multd
Sub
Divd
P36
P38
P40
P34
P32
P36
P4
P34
P32
Mult1
Yes
Yes
No
F4
F6
F8 F10 F12
...
P4
P32
P38
Register Rename and Result
Clock
F0 F2
7
3
4
P36
P34
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
Yes
F30
P30
17
Renamed Scoreboard 8
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
2
3
6
6
Functional unit status:
Time Name
Int1
Int2
8 Mult1
0 Add
Divide
2/9/2011
FU
4
5
8
Busy
Op
dest
Fi
No
No
Yes
Yes
Yes
Multd
Sub
Divd
P36
P38
P40
P34
P32
P36
P4
P34
P32
Mult1
Yes
Yes
No
F4
F6
F8 F10 F12
...
P4
P32
P38
Register Rename and Result
Clock
F0 F2
8
3
4
P36
P34
S1
Fj
S2
Fk
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
Yes
F30
P30
18
Renamed Scoreboard 9
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
2
3
6
6
Functional unit status:
Time Name
Int1
Int2
7 Mult1
Add
Divide
2/9/2011
FU
4
5
8
9
S1
Fj
S2
Fk
Busy
Op
dest
Fi
No
No
Yes
No
Yes
Multd
P36
P34
P4
Divd
P40
P36
P32
Mult1
No
Yes
F4
F6
F8 F10 F12
...
F30
P4
P32
P38
Register Rename and Result
Clock
F0 F2
9
3
4
P36
P34
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
P30
19
Renamed Scoreboard 10
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
Functional unit status:
Time Name
Int1
Int2
6 Mult1
Add
Divide
Busy
No
No
Yes
Yes
Yes
Op
FU
P36
4
5
8
9
dest
Fi
S1
Fj
S2
Fk
FU
Qj
FU
Qk
Fj?
Rj
Fk?
Rk
WAR Hazard gone!
Multd
Addd
Divd
Register Rename and Result
Clock
F0 F2
10
3
4
P34
P36
P42
P40
P34
P38
P36
P4
P34
P32
Mult1
Yes
Yes
No
Yes
Yes
Yes
F4
F6
F8 F10 F12
...
F30
P4
P42
P38
P40
P12
P30
• Notice that P32 not listed in Rename Table
– Still live. Must not
be reallocated by accident
2/9/2011
CS252-S11, lecture 7
20
Renamed Scoreboard 11
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
2/9/2011
FU
8
9
S1
Fj
S2
Fk
Busy
Op
dest
Fi
No
No
Yes
Yes
Yes
Multd
Addd
Divd
P36
P42
P40
P34
P38
P36
P4
P34
P32
Mult1
Yes
Yes
No
F4
F6
F8 F10 F12
...
P4
P42
P38
Register Rename and Result
Clock
F0 F2
11
4
5
11
Functional unit status:
Time Name
Int1
Int2
5 Mult1
2 Add
Divide
3
4
P36
P34
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
Yes
F30
P30
21
Renamed Scoreboard 12
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
2/9/2011
FU
8
9
S1
Fj
S2
Fk
Busy
Op
dest
Fi
No
No
Yes
Yes
Yes
Multd
Addd
Divd
P36
P42
P40
P34
P38
P36
P4
P34
P32
Mult1
Yes
Yes
No
F4
F6
F8 F10 F12
...
P4
P42
P38
Register Rename and Result
Clock
F0 F2
12
4
5
11
Functional unit status:
Time Name
Int1
Int2
4 Mult1
1 Add
Divide
3
4
P36
P34
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
Yes
Yes
Yes
F30
P30
22
Renamed Scoreboard 13
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
3
4
4
5
8
9
11
13
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
No
Yes
Yes
Yes
Multd
Addd
Divd
P36
P42
P40
P34
P38
P36
P4
P34
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
3 Mult1
0 Add
Divide
Register Rename and Result
Clock
F0 F2
13
2/9/2011
FU
P36
P34
Fj?
Rj
Fk?
Rk
Mult1
Yes
Yes
No
Yes
Yes
Yes
F6
F8 F10 F12
...
F30
P42
P38
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
P30
23
Renamed Scoreboard 14
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
3
4
4
5
8
9
11
13
14
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
No
Yes
No
Yes
Multd
P36
P34
P4
Divd
P40
P36
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
2 Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
14
2/9/2011
FU
P36
P34
Fj?
Rj
Fk?
Rk
Yes
Yes
Mult1
No
Yes
F6
F8 F10 F12
...
F30
P42
P38
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
P30
24
Renamed Scoreboard 15
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
3
4
4
5
8
9
11
13
14
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
No
Yes
No
Yes
Multd
P36
P34
P4
Divd
P40
P36
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
1 Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
15
2/9/2011
FU
P36
P34
Fj?
Rj
Fk?
Rk
Yes
Yes
Mult1
No
Yes
F6
F8 F10 F12
...
F30
P42
P38
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
P30
25
Renamed Scoreboard 16
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
3
4
16
8
4
5
11
13
14
Busy
Op
dest
Fi
S1
Fj
S2
Fk
No
No
Yes
No
Yes
Multd
P36
P34
P4
Divd
P40
P36
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
0 Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
16
2/9/2011
FU
P36
P34
9
Fj?
Rj
Fk?
Rk
Yes
Yes
Mult1
No
Yes
F6
F8 F10 F12
...
F30
P42
P38
CS252-S11, lecture 7
FU
Qj
P40
FU
Qk
P12
P30
26
Renamed Scoreboard 17
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
3
4
16
8
4
5
17
9
11
13
14
Busy
Op
dest
Fi
S1
Fj
S2
Fk
FU
Qj
No
No
No
No
Yes
Divd
P40
P36
P32
F4
P4
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
Divide
Register Rename and Result
Clock
F0 F2
17
2/9/2011
FU
P36
P34
Fj?
Rj
Fk?
Rk
Mult1
Yes
Yes
F6
F8 F10 F12
...
F30
P42
P38
CS252-S11, lecture 7
P40
FU
Qk
P12
P30
27
Renamed Scoreboard 18
Instruction status:
Instruction
LD
F6
LD
F2
MULTD F0
SUBD
F8
DIVD
F10
ADDD
F6
j
34+
45+
F2
F6
F0
F8
k
R2
R3
F4
F2
F6
F2
Read Exec Write
Issue Oper Comp Result
1
2
3
4
5
10
2
3
6
6
18
11
Functional unit status:
Time Name
Int1
Int2
Mult1
Add
40 Divide
2/9/2011
FU
4
5
17
9
13
14
S1
Fj
S2
Fk
FU
Qj
Busy
Op
dest
Fi
No
No
No
No
Yes
Divd
P40
P36
P32
Mult1
Yes
Yes
F4
F6
F8 F10 F12
...
F30
P4
P42
P38
Register Rename and Result
Clock
F0 F2
18
3
4
16
8
P36
P34
CS252-S11, lecture 7
P40
FU
Qk
P12
Fj?
Rj
Fk?
Rk
P30
28
Explicit Renaming Support Includes:
• Rapid access to a table of translations
• A physical register file that has more registers than
specified by the ISA
• Ability to figure out which physical registers are
free.
– No free registers  stall on issue
• Thus, register renaming doesn’t require reservation
stations. However:
– Many modern architectures use explicit register renaming +
Tomasulo-like reservation stations to control execution.
2/9/2011
CS252-S11, lecture 7
29
How many instructions can be in
the pipeline?
Which features of an ISA limit the number of
instructions in the pipeline?
Number of Registers
Which features of a program limit the number of
instructions in the pipeline?
Control transfers
Out-of-order dispatch by itself does not provide
any significant performance improvement !
2/9/2011
CS252-S11, lecture 7
30
How important is renaming?
Consider execution without it
1
LD
F2,
34(R2)
2
LD
F4,
45(R3)
latency
1
2
4
3
long
3
MULTD
F6,
F4,
F2
3
4
SUBD
F8,
F2,
F2
1
5
DIVD
F4,
F2,
F8
4
6
ADDD
F10,
F6,
F4
1
In-order:
Out-of-order:
1
5
6
1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
1 (2,1) 4 4 . . . . 2 3 . . 3 5 . . . 5 6 6
Out-of-order execution did not allow any significant improvement!
2/9/2011
CS252-S11, lecture 7
31
Little’s Law
Throughput (T) = Number in Flight (N) / Latency (L)
Issue
Execution
WB
Example:
• 4 floating point registers
• 8 cycles per floating point operation
 maximum of ½ issue per cycle without renaming!
2/9/2011
CS252-S11, lecture 7
32
Instruction-level Parallelism via Renaming
1
LD
F2,
34(R2)
2
LD
F4,
45(R3)
latency
1
2
4
3
long
3
MULTD
F6,
F4,
F2
3
4
SUBD
F8,
F2,
F2
1
5
DIVD
F4’,
F2,
F8
4
6
ADDD
F10,
F6,
F4’
1
In-order:
Out-of-order:
1
X
5
6
1 (2,1) . . . . . . 2 3 4 4 3 5 . . . 5 6 6
1 (2,1) 4 4 5 . . . 2 (3,5) 3 6 6
Any antidependence can be eliminated by renaming.
(renaming  additional storage)
Can be done either in Software or Hardware
2/9/2011
CS252-S11, lecture 7
33
Administrative
• Midterm I: Wednesday 3/14
Location: 320 Soda Hall
TIME: 2:30-5:30
– Can have 1 sheet of 8½x11 handwritten notes – both sides
– No microfiche of the book!
• This info is on the Lecture page
• Meet at LaVal’s afterwards for Pizza and Beverages
– Great way for me to get to know you better
– I’ll Buy!
• Don’t quite have new set of papers up. Will do that by
tomorrow at the latest….
2/9/2011
CS252-S11, lecture 7
34
Data-Flow Architectures
• Basic Idea: Hardware respresents direct encoding of compiler
dataflow graphs:
Input: a,b
y:= (a+b)/x
x:= (a*(a+b))+b
output: y,x
B
A
+
• Data flows along arcs in
“Tokens”.
• When two tokens arrive at
compute box, box “fires” and
produces new token.
• Split operations produce copies
of tokens
*
/
+
X(0)
Y
2/9/2011
CS252-S11, lecture 7
X
35
Paper by Dennis and Misunas
Operation
Unit 0
Operation
Unit m-1
Instruction
Operand 1
Operand 2
Operation
Packet
Data Packets
Instruction Cell
“Reservation Station?”
2/9/2011
CS252-S11, lecture 7
Instruction
Cell 0
Instruction
Cell 1
Memory
Instruction
Cell n-1
36
What about Precise
Exceptions/Interrupts?
• Both Scoreboard and Tomasulo have:
– In-order issue, out-of-order execution, out-of-order completion
• Recall: An interrupt or exception is precise if there is a
single instruction for which:
– All instructions before that have committed their state
– No following instructions (including the interrupting
instruction) have modified any state.
• Need way to resynchronize execution with instruction
stream (I.e. with issue-order)
– Easiest way is with in-order completion (i.e. reorder buffer)
– Other Techniques (Smith paper): Future File, History Buffer
2/9/2011
CS252-S11, lecture 7
37
Exception Handling
Commit
Point
(In-Order Five-Stage Pipeline)
PC
Select
Handler
PC
Inst.
Mem
PC Address
Exceptions
Kill F
Stage
•
•
•
•
2/9/2011
D
Decode
E
Illegal
Opcode
+
M
Overflow
Data
Mem
Data Addr
Except
W
Kill
Writeback
Exc
D
Exc
E
Exc
M
Cause
PC
D
PC
E
PC
M
EPC
Kill D
Stage
Kill E
Stage
Asynchronous
Interrupts
Hold exception flags in pipeline until commit point (M stage)
Exceptions in earlier pipe stages override later exceptions
Inject external interrupts at commit point (override others)
If exception at commit: update Cause and EPC registers, kill
all stages, inject handler PC into fetch stage
CS252-S11, lecture 7
38
Complex In-Order Pipeline: Precise Exceptions
PC
Inst.
Mem
D
Decode
• Delay writeback so all
operations have same latency
to W stage
GPRs
FPRs
X1
+
X1
– Write ports never oversubscribed
(one inst. in & one inst. out every cycle)
– Instructions commit in order, simplifies
precise exception implementation
X2
Data
Mem
X3
Xn
W
X2
Fadd
X3
Xn
W
X2
• How to prevent increase latency for
single-cycle ops?
– Bypassing
– However: can be very expensive
Fmul
X3
Xn
X3
Xn
Commit
Point
Unpipelined
FDiv X2
divider
• Other downside: no out-of-order
execution
2/9/2011
CS252-S11, lecture 7
39
In-Order Commit for Precise Exceptions
In-order
Fetch
Out-of-order
Reorder Buffer
Decode
Kill
In-order
Commit
Kill
Kill
Execute
Inject handler PC
Exception?
• Instructions fetched and decoded into instruction
reorder buffer in-order
• Execution is out-of-order (  out-of-order completion)
• Commit (write-back to architectural state, i.e., regfile &
memory) is in-order
Temporary storage needed to hold results before commit
(shadow registers and store buffers)
2/9/2011
CS252-S11, lecture 7
40
Program Counter
Valid
Exceptions?
Result
Reorder Table
FP
Op
Queue
Res Stations
Compar network
Dest Reg
What are the hardware complexities
with reorder buffer (ROB)?
Reorder
Buffer
FP Regs
Res Stations
FP Adder
FP Adder
• How do you find the latest version of a register?
– As specified by Smith paper, need associative comparison network
– Could use future file or just use the register result status buffer to track which
specific reorder buffer has received the value
• Need as many ports on ROB as register file
2/9/2011
CS252-S11, lecture 7
41
Four Steps of Speculative Tomasulo
1. Issue—get instruction from FP Op Queue
If reservation station and reorder buffer slot free, issue instr & send
operands & reorder buffer no. for destination (this stage sometimes
called “dispatch”)
2. Execution—operate on operands (EX)
When both operands ready then execute; if not ready, watch CDB for
result; when both in reservation station, execute; checks RAW
(sometimes called “issue”)
3. Write result—finish execution (WB)
Write on Common Data Bus to all awaiting FUs
& reorder buffer; mark reservation station available.
4. Commit—update register with reorder result
When instr. at head of reorder buffer & result present, update
register with result (or store to memory) and remove instr from
reorder buffer. Mispredicted branch flushes reorder buffer
(sometimes called “graduation”)
2/9/2011
CS252-S11, lecture 7
42
Tomasulo With Reorder buffer:
FP Op
Queue
Done?
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
ROB3
ROB2
F0
LD F0,10(R2)
Registers
Dest
2/9/2011
ROB1
Oldest
To
Memory
from
Memory
Dest
FP adders
N
Reservation
Stations
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
43
Tomasulo With Reorder buffer:
FP Op
Queue
Done?
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
ROB3
F10
F0
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
2/9/2011
N
N
ROB2
ROB1
Oldest
To
Memory
from
Memory
Dest
Reservation
Stations
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
44
Tomasulo With Reorder buffer:
FP Op
Queue
Done?
ROB7
ROB6
Newest
ROB5
Reorder Buffer
ROB4
F2
F10
F0
DIVD F2,F10,F6
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
2/9/2011
N
N
N
ROB3
ROB2
ROB1
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
45
Tomasulo With Reorder buffer:
FP Op
Queue
ROB7
Reorder Buffer
F0
F4
-F2
F10
F0
ADDD F0,F4,F6
LD F4,0(R3)
BNE F2,<…>
DIVD F2,F10,F6
ADDD F10,F4,F0
LD F0,10(R2)
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
FP adders
2/9/2011
Done?
N
N
N
N
N
N
ROB6
Newest
ROB5
ROB4
ROB3
ROB2
ROB1
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
FP multipliers
CS252-S11, lecture 7
from
Memory
Dest
1 10+R2
6 0+R3
46
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
-- ROB5
F0
F4
-F2
F10
F0
Done?
ST 0(R3),F4
N ROB7
ADDD F0,F4,F6
N ROB6
LD F4,0(R3)
N ROB5
BNE F2,<…>
N ROB4
DIVD F2,F10,F6 N ROB3
ADDD F10,F4,F0 N ROB2
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD ROB5, R(F6)
FP adders
2/9/2011
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
FP multipliers
CS252-S11, lecture 7
from
Memory
Dest
1 10+R2
6 0+R3
47
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
-- M[10]
F0
F4 M[10]
-F2
F10
F0
Done?
ST 0(R3),F4
Y ROB7
ADDD F0,F4,F6
N ROB6
LD F4,0(R3)
Y ROB5
BNE F2,<…>
N ROB4
DIVD F2,F10,F6 N ROB3
ADDD F10,F4,F0 N ROB2
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
6 ADDD M[10],R(F6)
FP adders
2/9/2011
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
48
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
Done?
-- M[10] ST 0(R3),F4
Y ROB7
F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3)
Y ROB5
-BNE F2,<…>
N ROB4
F2
DIVD F2,F10,F6 N ROB3
F10
ADDD F10,F4,F0 N ROB2
F0
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
2/9/2011
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
49
Tomasulo With Reorder buffer:
FP Op
Queue
Reorder Buffer
What about memory
hazards???
Done?
-- M[10] ST 0(R3),F4
Y ROB7
F0 <val2> ADDD F0,F4,F6 Ex ROB6
F4 M[10] LD F4,0(R3)
Y ROB5
-BNE F2,<…>
N ROB4
F2
DIVD F2,F10,F6 N ROB3
F10
ADDD F10,F4,F0 N ROB2
F0
LD F0,10(R2)
N ROB1
Registers
Dest
2 ADDD R(F4),ROB1
FP adders
2/9/2011
Newest
Oldest
To
Memory
Dest
3 DIVD ROB2,R(F6)
Reservation
Stations
from
Memory
Dest
1 10+R2
FP multipliers
CS252-S11, lecture 7
50
Memory Disambiguation:
Sorting out RAW Hazards in memory
• Question: Given a load that follows a store in program
order, are the two related?
– (Alternatively: is there a RAW hazard between the store and the load)?
Eg:
st
ld
0(R2),R5
R6,0(R3)
• Can we go ahead and start the load early?
– Store address could be delayed for a long time by some calculation that
leads to R2 (divide?).
– We might want to issue/begin execution of both operations in same cycle.
– Today: Answer is that we are not allowed to start load until we know that
address 0(R2)  0(R3)
– Next Week: We might guess at whether or not they are dependent (called
“dependence speculation”) and use reorder buffer to fixup if we are
wrong.
2/9/2011
CS252-S11, lecture 7
51
Hardware Support for Memory
Disambiguation
• Need buffer to keep track of all outstanding stores to
memory, in program order.
– Keep track of address (when becomes available) and value (when
becomes available)
– FIFO ordering: will retire stores from this buffer in program order
• When issuing a load, record current head of store queue
(know which stores are ahead of you).
• When have address for load, check store queue:
– If any store prior to load is waiting for its address, stall load.
– If load address matches earlier store address (associative lookup), then
we have a memory-induced RAW hazard:
» store value available  return value
» store value not available  return ROB number of source
– Otherwise, send out request to memory
• Actual stores commit in order, so no worry about
WAR/WAW hazards through memory.
2/9/2011
CS252-S11, lecture 7
52
Memory Disambiguation:
Done?
FP Op
Queue
ROB7
ROB6
Newest
ROB5
Reorder Buffer
-F2 R[F5]
F0
-- <val 1>
LD
ST
LD
ST
F4, 10(R3)
10(R3), F5
F0,32(R2)
0(R3), F4
Registers
Dest
2/9/2011
ROB4
ROB3
ROB2
ROB1
Oldest
To
Memory
from
Memory
Dest
FP adders
N
N
N
Y
Reservation
Stations
FP multipliers
CS252-S11, lecture 7
Dest
2 32+R2
4 ROB3
53
Relationship between precise
interrupts and speculation:
• Speculation is a form of guessing
– Branch prediction, data prediction
– If we speculate and are wrong, need to back up and restart execution to
point at which we predicted incorrectly
– This is exactly same as precise exceptions!
• Branch prediction is a very important!
– Need to “take our best shot” at predicting branch direction.
– If we issue multiple instructions per cycle, lose lots of potential
instructions otherwise:
» Consider 4 instructions per cycle
» If take single cycle to decide on branch, waste from 4 - 7 instruction
slots!
• Technique for both precise interrupts/exceptions and
speculation: in-order completion or commit
– This is why reorder buffers in all new processors
2/9/2011
CS252-S11, lecture 7
54
Quick Recap:
Explicit Register Renaming
• Make use of a physical register file that is larger than
number of registers specified by ISA
• Keep a translation table:
– ISA register => physical register mapping
– When register is written, replace table entry with new register from
freelist.
– Physical register becomes free when not being used by any
instructions in progress.
Fetch
Decode/
Rename
Execute
Rename
Table
2/9/2011
CS252-S11, lecture 7
55
Explicit register renaming:
R10000 Freelist Management
P0 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Done?
Current Map Table
Newest
P32 P34 P36 P38
 P60 P62
Freelist
Oldest
• Physical register file larger than ISA register file
• On issue, each instruction that modifies a register is
allocated new physical register from freelist
• Used on: R10000, Alpha 21264, HP PA8000
2/9/2011
CS252-S11, lecture 7
56
Explicit register renaming:
R10000 Freelist Management
P32 P2 P4 F6 F8 P10 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Done?
Current Map Table
Newest
P34 P36 P38 P40
Freelist
 P60 P62
F0 P0 LD P32,10(R2)
N
Oldest
• Note that physical register P0 is “dead” (or not “live”)
past the point of this load.
– When we go to commit the load, we free up
2/9/2011
CS252-S11, lecture 7
57
Explicit register renaming:
R10000 Freelist Management
P32 P2 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Done?
Current Map Table
Newest
P36 P38 P40 P42
Freelist
2/9/2011
 P60 P62
F10 P10 ADDD P34,P4,P32 N
F0 P0 LD P32,10(R2)
N
CS252-S11, lecture 7
Oldest
58
Explicit register renaming:
R10000 Freelist Management
P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Done?
Current Map Table
--
P38 P40 P44 P48
 P60 P62
Freelist
-F2 P2
F10 P10
F0 P0
Newest
BNE P36,<…>
DIVD P36,P34,P6
ADDD P34,P4,P32
LD P32,10(R2)
N
N
N
N
Oldest
P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
P38 P40 P44 P48
2/9/2011
 P60 P62
Checkpoint at BNE instruction
CS252-S11, lecture 7
59
Explicit register renaming:
R10000 Freelist Management
P40 P36 P38 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Current Map Table
P42 P44 P48 P50

P0 P10
Freelist
-F0 P32
F4 P4
-F2 P2
F10 P10
F0 P0
Done?
ST 0(R3),P40
Y
Newest
ADDD P40,P38,P6 Y
LD P38,0(R3)
Y
BNE P36,<…>
N
DIVD P36,P34,P6 N
ADDD P34,P4,P32 y
Oldest
LD P32,10(R2)
y
P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
P38 P40 P44 P48
2/9/2011
 P60 P62
Checkpoint at BNE instruction
CS252-S11, lecture 7
60
Explicit register renaming:
R10000 Freelist Management
P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
Done?
Current Map Table
Newest
P38 P40 P44 P48

P0 P10
Freelist
F2 P2 DIVD P36,P34,P6 N
F10 P10 ADDD P34,P4,P32 y
F0 P0 LD P32,10(R2)
y
Oldest
Error fixed by restoring map table and merging freelist
P32 P36 P4 F6 F8 P34 P12 P14 P16 P18 P20 P22 P24 p26 P28 P30
P38 P40 P44 P48
2/9/2011
 P60 P62
Checkpoint at BNE instruction
CS252-S11, lecture 7
61
Advantages of Explicit Renaming
• Decouples renaming from scheduling:
– Pipeline can be exactly like “standard” DLX pipeline (perhaps with
multiple operations issued per cycle)
– Or, pipeline could be tomasulo-like or a scoreboard, etc.
– Standard forwarding or bypassing could be used
• Allows data to be fetched from single register file
– No need to bypass values from reorder buffer
– This can be important for balancing pipeline
• Many processors use a variant of this technique:
– R10000, Alpha 21264, HP PA8000
• Another way to get precise interrupt points:
– All that needs to be “undone” for precise break point
is to undo the table mappings
– Provides an interesting mix between reorder buffer and future file
» Results are written immediately back to register file
» Registers names are “freed” in program order (by ROB)
2/9/2011
CS252-S11, lecture 7
62
Superscalar Register Renaming
• During decode, instructions allocated new physical destination register
• Source operands renamed to physical register with newest value
• Execution unit only sees physical register numbers
Inst 1
Op
Write
Ports
Update
Mapping
Dest Src1 Src2
Op
Dest Src1 Src2
Read Addresses
Register
Free List
Rename Table
Read Data
Op
PDest PSrc1 PSrc2
Inst 2
Op
PDest PSrc1 PSrc2
Does this work?
2/9/2011
CS252-S11, lecture 7
63
Superscalar Register Renaming (Try #2)
Update
Mapping
Op Dest Src1 Src2
Write
Ports
Inst 1
Op Dest Src1 Src2 Inst 2
Read Addresses
=?
Rename Table
Read Data
Must check for
RAW hazards
between
instructions
issuing in same
cycle. Can be
done in parallel
with rename
Op PDest PSrc1 PSrc2
lookup.
=?
Register
Free List
Op PDest PSrc1 PSrc2
MIPS R10K renames 4 serially-RAW-dependent insts/cycle
2/9/2011
CS252-S11, lecture 7
64
Summary
• DataFlow view:
– Data triggers execution rather than instructions triggering data
• Dynamic hardware schemes can unroll loops
dynamically in hardware
– Form of limited dataflow
– Register renaming is essential
• Explicit Renaming: more physical registers than needed
by ISA.
– Rename table: tracks current association between architectural
registers and physical registers
– Uses a translation table to perform compiler-like
transformation on the fly
• Precise Interrupts:
– Must commit things back in order
– Reorder buffer: temporarily holds results until commit possible
– Toss out things to achieve precise interrupt point
2/9/2011
CS252-S11, lecture 7
65
Download