Construct a version of the table that we have in class for 1/1

advertisement
1) Consider the following MIPS assembly code.
LD
DADD
DSUB
DADD
BNEZ
DADD
DSUB
R1, 45(R2)
R7, R1, R5
R8, R1, R6
R9, R5, R1
R7, target
R10, R8, R5
R2, R3, R4
a) Identify each dependence by type; list the two instructions
involved; identify which instruction is dependent; and, if there is
one, name the storage location involved.
b) Use information about the MIPS five-stage pipeline from Appendix A
and assume a register file that writes in the first half of the clock
cycle and reads in the second half-cycle forwarding.
Which of the
dependences that you found in part (a) become hazards and which do not?
Why?
Solution:
a) Dependences:
LD
DADD
Instruction
R1, 45(R2)
R7, R1, R5
Type of Dependence
RAW Dependence or True
Dependence, Dadd
depends on LD
Storage Location
R1
DSUB
R8, R1, R6
RAW Dependence or True
Dependence, DSUB
depends on LD
RAW Dependence or True
Dependence, DADD
depends on LD
RAW Dependence or True
Dependence, BNEZ
Depends on DADD
RAW Dependence or True
Dependence and Control
Dependence- DADD
depends on DSUB and
Branch
Control DependenceDSUB depends on Branch
R1
DADD
R9, R5, R1
BNEZ
R7, target
DADD
R10, R8, R5
DSUB
R2, R3, R4
R1
R7
R8, R7
R7
b) Using MIPS Five Stage Pipeline, we get the next schedule for our
instructions.
Assumption: The only way of forwarding is through the register file.
LD
R1, 45(R2)
DADD
R7, R1, R5
DSUB
R8, R1, R6
DADD
R1
BNEZ
R9, R5,
R7, target
DADD
R10, R8, R5
DSUB
R2, R3, R4
1
2
3
4
5
IF
ID
EXE
WB
IF
S
ME
M
S
ID
IF
6
7
ME
M
ID
WB
IF
EX
E
ID
IF
8
9
ME
M
EX
E
ID
WB
IF
ME
M
EX
E
S
10
11
12
13
ME
M
EX
E
WB
14
WB
ME
M
ID
IF
WB
EX
E
ID
ME
M
WB
Data Dependences that became hazards:
1) LD and DADD RAW Dependence-See the 2 cycles that need to be
stalled2) BNEZ and DADD, Control Dependence that becomes hazard because of
the need for checking the branch and waiting for the effective
address calculation of the next instruction to be executed.
2) Construct a version of the table that we have in class for 1/1
predictor assuming the 1-bit predictors are initialized to NT,
the correlation bit is initialized to T, and the value of d
(leftmost column of the table) alternates 0,1,2,0,1,2.
Also,
note and count the number of instances of misprediction.
Solution:
Considering a sequence where d alternates between 0 and 2 and we can assume a NT
value for B2 at the beginning.
d=?
0
1
2
0
1
2
B1
Prediction
NT/NT
NT/NT
T/NT
T/NT
T/NT
T/NT
B1
Action
NT
T
T
NT
T
T
New B1
B2
Prediction Prediction
NT/NT
NT/NT
NT/NT
T/NT*
T/NT
NT/NT
T/NT
NT/T
T/NT
NT/T
T/NT
NT/NT
B2
Action
NT
NT
T
NT
NT
T
New B2
Prediction
NT/NT
NT/NT
NT/T*
NT/T
NT/NT*
NT/T*
Total Mispredictions: 4
3) Increasing the size of a branch-prediction buffer means that it
is less likely that two branches in a program will share the same
predictor
A
single
predictor
predicting
a
single
branch
instruction is generally more accurate than is the same predictor
serving more that one branch instruction.
a) List a sequence of branch taken and not taken actions to show a
simple example of 1-bit predictor sharing that reduces misprediction
rate.
b) List a sequence of branch taken and not taken actions to show a
simple example of 1-bit predictor sharing that increases misprediction
rate.
c) Discuss why the sharing of branch predictors can be expected to
increase mispredictions for the long instruction execution sequences of
actual programs.
Solution:
a) Let’s consider two branches B1 and B2, executed alternatively and
alternating between TAKEN/NOT TAKEN. The next table shows the
values for the predictions and the mispredictions. Because a
single predictor is shared here, prediction accuracy improves
from 0% to 50%.
P
NT
Correct
Prediction?
B1
T
P
T
B2
NT
No
P
NT
No
B1
NT
P
NT
Yes
B2
T
P
T
No
B1
T
P
T
Yes
B2
NT
P
NT
No
B1
NT
P
N
T
Yes
B2
T
No
b) For this part, let’s consider two Braches B1 and B2 where B1 is
always TAKEN and B2 is always NOT TAKEN and we follow the same
pattern as in part a. If each branch had a 1-bit predictor, each
would be correctly predicted. Because they share a predictor, the
accuracy of our predictions is 0%.-See table belowP
NT
Correct
Prediction?
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
P
NT
B1
T
No
P
T
B2
NT
No
c) In general terms, if a predictor is shared by a set of branch
instructions, then over the course of program execution set
membership is very likely to change. When a new branch enters the
set or an old one leaves the set, the branch action history
represented by the state of the predictor is unlikely to predict
new set behaviors as it did before. Then, transient intervals
following set changes likely will reduce the long term prediction
accuracy for our shared predictors.
d) Consider the following loop.
bar:
L.D
MUL.D
L.D
ADD.D
S.D
ADDI
ADDI
ADDI
BNEZ
F2,
F4,
F6,
F6,
F6,
R1,
R2,
R3,
R3,
0(R1)
F2, F0
0(R2)
F4, F6
0(R2)
R1, #8
R2, #8
R3, #-8
bar
a) Assume a single-issue pipeline.
Show how the loop would look
both unscheduled by the compiler and after compiler scheduling
for both floating-point operation and branch delays, including
any stall or idle clock cycles.
What is the execution time per
iteration of the result, unscheduled and scheduled?
How much
faster must the clock be for processor hardware alone to match
the performance improvement achieved by the scheduling compiler
(neglect the possible increase in the number of cycles necessary
for memory system access effects of higher processor clock speed
on memory system performance?)
Unscheduled:
Instruction
L.D
F2, 0(R1)
MUL.D F4, F2, F0
L.D
F6, 0(R2)
ADD.D F6, F4, F6
S.D
ADDI
ADDI
ADDI
F6,
R1,
R2,
R3,
0(R2)
R1, #8
R2, #8
R3, #-8
BNEZ
R3, bar
Clock
Number
1
Stall
3
4
Stall
Stall
7
Stall
Stall
10
11
12
13
Stall
15
Stall
Cycle
Total Execution Time in clock cycles: 16 Cycles
Scheduled:
Instruction
L.D
L.D
MUL.D
ADDI
ADDI
ADDI
ADD.D
F2,
F6,
F4,
R1,
R3,
R2,
F6,
0(R1)
0(R2)
F2, F0
R1, #8
R3, #-8
R2, #8
F4, F6
BNEZ
S.D
R3, bar
F6, -8(R2)
Clock
Number
1
2
3
4
5
6
7
Stall
9
10
Cycle
Total Execution Time in clock cycles: 10 Cycles
How much faster has the clock to be to get this improvement just using
hardware?
16 cycles/10 cycles= 1.6 times faster than the original then the
clock must be 60% faster to match the performance of the schedule code
on the original hardware.
b) Assume a single-issue pipeline. Unroll the loop as many times as
necessary to schedule it without any stalls, collapsing the loop
overhead instructions. How many times must the loop be unrolled?
Show the instruction schedule. What is the execution time per
element of the result iteration?
What is the major contribution
to the reduction in time per iteration?
Solution:
For this problem, we can unroll at least 2 times the loop and schedule it to avoid stalls. It
can be unrolled more than 2 times and fit a better performance but initially the goal is to
find the first schedule using unrolling that doesn’t have stalls.
Instruction
L.D
L.D
L.D
L.D
F2, 0(R1)
F8, 8(R1)
F6, 0(R2)
F12, 8(R2)
Clock
Cycle
Number
1
2
3
4
MUL.D F4, F2, F0
MUL.D F10, F8, F0
5
6
*ADDI
7
8
*ADDI
R1, R1, #16
R3, R3, #-8
ADD.D F6, F4, F6
ADD.D F12, F10, F12
*ADDI
R2, R2, #16
9
10
11
S.D
F6, 0(R2)
*BNEZ
R3, bar
S.D
F12, 8(R2)
12
13
14
In this exercise, we have produced 2 results using 14 cycles, which results in 7 clock
cycles per element. The major advantage in the unrolled case is that we have eliminated 4
instructions compared to not unrolling, to be precise, the loop overhead instructions of
one of the iterations-the instructions with * in the table aboveAdditionally, with unrolling the loop body is more suited to scheduling and allows the
stall cycle present in the scheduled original loop to be eliminated.
Download