Document

advertisement
Dynamic Instruction Scheduling
(Example)
High Performance Computer Architecture
http://www.dii.unisi.it/~giorgi/teaching/hpca2
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -1
di 18
Example
loop: r3
<- mem(r4+r2)
r7
<- mem(r5+r2)
r7
<- r7 * r3
r1
<- r1 - 1
mem(r6+r2)<- r7
r2
<- r2 + 8
P
<- loop; r1!=0
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -2
di 18
#
#
#
#
#
#
#
load b(i)
load c(i)
b(i) * c(i)
decr. Counter
store a(i)
bump index
close loop
DISPATCH
CYCLE 1
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0 100
0
0
6
0 1000
0 2000
0 3000
0
49
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -3
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
di 18
Id
1
2
3
4
5
6
7
8
Busy
1
Op
load
Vj
1000
Vk
Qj
0
0
Qk
0
DISPATCH
CYCLE 2
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0 100
0
0
6
0 1000
0 2000
0 3000
7
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -4 di 18
Id
1
2
3
4
5
6
7
8
Busy
1
1
Op
load
load
Vj
1000
2000
Vk
Qj
0
0
0
0
Qk
0
0
DISPATCH
CYCLE 3
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
0
6
0
0
0
74
100
0
1000
2000
3000
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -5 di 18
Id
1
2
3
4
5
6
7
8
Busy
Op
1
mult
1
1
load
load
Vj
1000
2000
Vk
0
0
Qj
Qk
6
7
0
0
0
0
- The first load complets: let’s assume that reads ’13’
DISPATCH
CYCLE 4
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
1
0
0
0
13
0 1000
0 2000
0 3000
4
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
Id
1
2
3
4
5
6
7
8
Busy
1
Op
sub
Vj
100
1
mult
13
0
1
load
2000
Vk
1
0
- The first load writes on the CDB (the value 13)
- The sub goes in dispatch
- The second load is issued
- The mult can’t be issued until it gets Qi=Qk=0
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -6 di 18
Qj
0
Qk
0
0
7
0
0
DISPATCH
CYCLE 5
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
conflict
on the CDB
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
1
0
0
0
13
0 1000
0 2000
0 3000
4
SQ:
A
Q V
4
Id
1
2
3
4
5
6
7
8
Busy
1
Op
sub
Vj
100
1
mult
13
1
1
sto
load
3000
2000
Vk
1
0
0
Qj
0
Qk
0
0
7
0
0
0
0
- The second load complets and let’s assume it reads ’11’
- The mult waits and the sub is issued
- The store goes in dispatch
-Simultaneously we allocate one element in the SQ
-The sub is going to conflict on the CDB with the load, then will have to wait
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -7 di 18
DISPATCH
CYCLE 6
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
1
2
0
0
0
0
4
V
13
1000
2000
3000
SQ:
A
Q V
3000 4
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -8 di 18
Id
1
2
3
4
5
6
7
8
Busy
1
1
Op
sub
add
Vj
100
0
Vk
1
8
1
mult
13
11
1
0
sto
3000
0
LQ
Qj
0
0
A
0
0
Qk
0
0
0
0
- The second load writes on the CDB (the value 11)
- The mult is issued, and the sub is waiting the CDB
- The store is issued: in the SQ it gets the effective address A
- but it can’t advance, until Qi != 0
- The add goes in dispatch
DISPATCH
CYCLE 7
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
99
2
0
13
0 1000
0 2000
0 3000
4
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -9 di 18
Id
1
2
3
4
5
6
7
8
Busy
0
1
1
1
1
0
Op
add
brch
mult
sto
Vj
0
99
13
3000
Vk
8
0
LQ
11
0
Qj
Qk
0
A00
0
0
0
0
0
- The mult proceeds and the store waits
- The sub complets and updates R1 (and the CDB) with ’99’
- The add is issued
- The branch goes in dispatch
DISPATCH
CYCLE 8
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
99
0
8
0
13
0 1000
0 2000
0 3000
4
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -10 di 18
Id
1
2
3
4
5
6
7
8
Busy
0
0
1
1
Op
1
0
sto
brch
mult
Vj
Vk
99
13
3000
- The mult proceeds and the store waits
- The add writes on the CDB
- The branch is issued
Qj
Qk
0
11
0
0
0
0
0
0
0
DISPATCH
CYCLE 9
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
99
0
8
0
13
0 1000
0 2000
0 3000
4
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -11 di 18
Id
1
2
3
4
5
6
7
8
Busy
0
0
0
1
Op
1
0
sto
mult
Vj
Vk
LQ
Qj
13
11
3000
0
SQ
A
A Q
- The mult complets and calculates 13*11=143
- The store waits
- The branch complets
Qk
0
0
0
0
V
DISPATCH
CYCLE 10
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
99
0
8
0
13
0 1000
0 2000
0 3000
0 143
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
SQ:
A
Q V
3000 0 143
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -12 di 18
Id
1
2
3
4
5
6
7
8
Busy
0
0
0
0
1
0
Op
sto
Vj
3000
Vk
0
Qj
0
Qk
0
- The mult writes on the CDB (the value ‘143’)
- The store gets the value ‘143’ and can finally complete
DISPATCH
CYCLE 11
ISSUE
Mult. 1
Mult. 2
Mult. 3
Mult. 4
Common Data Bus
M RS
I-cache
access
LQ
CIP
NIP
Address
Add
Decode
r3 <- mem(r4+r2)
r7 <- mem(r5+r2)
r7 <- r7 * r3
r1 <- r1 – 1
mem(r6+r2)<- r7
r2 <- r2 + 8
P <- loop; r1!=0
D-Cache
LS RS
Regs
SQ
Integer
A RS
WRITE-BACK
Reg
0
1
2
3
4
5
6
7
Q
V
0
99
0
8
0
13
0 1000
0 2000
0 3000
0 143
RS
A1
A2
A3
M1
M2
LS1
LS2
LS3
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -13 di 18
Id
1
2
3
4
5
6
7
8
Busy
0
0
0
0
0
0
Op
Vj
Vk
LQ
SQ
Qj
A
A Q
V
Qk
Tomasulo: Summary
• Reservation Stations
•
Allow the "out-of-order issue" based on the availability of data
(E.g. sub and add issued without waiting for the mult)
• Register Renaming (tags)
+ Avoids the WAR and WAW hazards
Especially important when there are few registers available
(as originally in the IBM 360)
+ Realize a dynamic "loop unrolling"
- Requires a relatively complex logic
• Common Data Bus
+ Simultaneously broadcast the results to more waiting instructions
- It’s a "bottleneck", but it can be replicated more times
(of course at a cost greater hw)
• The scheme does not handle "precise exceptions"
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -14 di 18
Tomasulo: hazard management summary
Hazard
Management method
Structural on RS (RS finite)
Stall in the Dispatch stage (*1)
Structural on CDB (CDB occupied) Stall in the Issue stage (*2)
Structural on FU (FU occupied)
Stall in the Issue stage (*3)
RAW
WAR
WAW
Avoided by using the tags
Avoided by coping operands in RS
at dispatch-time
Avoided by using SW Register Renaming
(*1) avoidable with a larger number of RSs
(*2) avoidable with a larger number of CDBs
(*3) avoidable with multiple FUs (or can reduce with pipelined FUs)
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -15 di 18
Reservation Station -- implementation
dispatch
dispatch: move to res. station
issue: move to functional unit
REGISTER
Qi
value
Busy
RES STAT.
RS No.
j operand
like
k operand
Op
Vj
Qj
compare
MUX
Vk
ld
AND
ld clr
Qk
OR
set
clr
Busy
MUX
compare
enbl
FF
=0?
=0?
AND
to
to
functional functional
unit
unit
CDB data
CDB tag
tag 1 cycle
before data
to
functional
unit
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -16 di 18
ready
to issue logic
busy
to dispatch logic
set
clr
Full
issue
General organization of IBM 360/91 pipeline
• “In-order” pipeline with the following stages:
• I-fetch, decode, address generation
• Floating point decoupled from the Integer (Fixed Point)
through memory buffers
• Effective-address generation done in the integer unit
• A memory pipeline for loading the data
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -17 di 18
IBM 360/91 -- Floating Point Unit
From: R.M. Tomasulo, “An efficient Algorithm for Exploring Arithmetic Units”, IBM Journal, Jan.1967, pp.25-33
-Roberto Giorgi, Universita' degli Studi di Siena, C216ES01--SL -18 di 18
Download