A sample final with solutions

advertisement
CIS 662 Final
Name:_________________________
Points:___________/100
1. (30 points)
For a given loop and a standard MIPS pipeline (EX stage for ADD.D takes 4 cycles, EX
stage for any other instruction takes 1 cycle, branches are resolved in ID stage):
Loop: L.D F2, 100(R1)
L.D F4, 500(R1)
ADD.D F6, F2, F4
S.D F6, 100(R1)
DADDUI R1, R1, #4
DADDUI R2, R2,#-1
BNEZ R2, Loop
a) (10 points) How many cycles per iteration does this loop take? How many stalls are
there?
L.D F2, 100(R1)
L.D F4, 500(R1)
stall
ADD.D F6, F2, F4
stall
stall
S.D F6, 100(R1)
DADDUI R1, R1, #4
DADDUI R2, R2,#-1
stall
BNEZ R2, Loop
stall
12 cycles, 5 stalls
b) (10 points) Shuffle instructions around to minimize stalls. You may change offsets if
necessary. How many cycles per iteration does modified loop take and how many
stalls are left?
L.D F2, 100(R1)
L.D F4, 500(R1)
DADDUI R2, R2,#-1
ADD.D F6, F2, F4
DADDUI R1, R1, #4
BNEZ R2, Loop
S.D F6, 96(R1)
7 cycles, no stalls
c) (10 points) Unroll the loop twice (so that there are total of two iterations in the
unrolled code) and rearrange the code so that there are no remaining stalls. How
many cycles does one iteration of the original loop take now?
L.D F2, 100(R1)
L.D F4, 500(R1)
L.D F8, 104(R1)
L.D F10, 504(R1)
ADD.D F6, F2, F4
ADD.D F12, F8, F10
DADDUI R2, R2,#-2
DADDUI R1, R1, #8
S.D F6, 92(R1)
BNEZ R2, Loop
S.D F12, 96(R1)
11/2 = 5.5 cycles
2. (20 points)
Explain what is speculation and how is it implemented in Tomasulo’s algorithm. You
must specify the resources used by Tomasulo’s algorithm to keep track of speculative
instructions, how are those resources used and what happens if the speculative decision
was wrong.
Speculation is the execution of the code following a branch before the outcome of the
branch is known. The code is fetched either sequentially or from a target, using branch
prediction and the branch target buffer. To perform execution safely we must have a way
to undo executed instructions if the branch was mispredicted. This means that no
instruction following a branch is allowed to write to memory or registers unless we know
the outcome of a branch. In TA this is implemented via a reorder buffer. The results of all
instructions are written to the reorder buffer and a new stage – commit – is added to all
instructions. During the commit stage results are transferred from the reorder buffer to the
memory or registers. Instructions must commit in order. When a branch commits, if the
prediction were correct all instructions following a branch will be able to commit in later
cycles. If the prediction were incorrect the reorder buffer is flushed and thus, effectively,
the instructions are undone.
3. (10 points)
Explain how should a 3,2 correlating predictor look like: how many bits it has per
branch and how it can be used to predict a branch outcome. How does this predictor
change states?
3,2 predictor has 2 bits and they depend on the outcome of 3 previous branches. The
predictor would thus have 23=8 slots, and 2 bits in each slot. It would look like this in its
initial state:
00/00/00/00/00/00/00/00
The leftmost slot (bolded) would be checked for prediction if three previous branches
were all not taken, the slot next to it would be checked if the first two branches were not
taken and the last one was taken, etc. The two bits in the chosen slot change their state
based on the state diagram of a two-bit predictor. The diagram looks like this:
Taken
Taken
Predict taken
11
Not taken
Predict
10
Not taken
Taken
Taken
Predict not taken
01
Not taken
Predict n
0
Not taken
4. (40 points)
You have a two-level cache with the following specifications:
o L1 is 64KB direct-mapped, write-through, no-write-allocate with 32B blocks,
hit time of 1ns and miss rate of 5%
o L2 is 1MB 4-way set-associative, write-back, write-allocate with 128B blocks
and on the average 30% of blocks are dirty. Hit time is 20ns and miss rate is
40%
o Penalty to go from L2 to memory is 60ns (this stays the same regardless of
how big data chunk we are reading from/writing to memory)
o It costs 1ns to transfer one CPU word between L1 and L2. One CPU word is
4B.
We are considering making L2 fully associative. This would increase hit time to
21ns but would reduce L2 miss rate. How large should be a miss rate reduction
(relative to original miss rate) so that this pays off in case of writes? Hint:
compare AMATwrite of the original and the modified processor.
AMATread = hitL1 + mrL1*(hitL2+mrL2*penaltyL2)
hitL1 = 1ns
mrL1=0.05
hitL2=20ns + 32/4*1ns – 28ns
mrL2 = 0.4
penaltyL2 = (1+0.3)*60ns = 78ns
AMATread = 1 + 0.05*(28+0.4*78) – 3.96
AMATwrite = hitL1 + wthrough + mrL1*mrL2*penaltyL2
Wthrough = 21ns
AMATwrite = 1 + 21 + 0.05*0.4*78 = 23.56
With the change
AMATwrite = 1 + 22 + 0.05*x*78 = 23 + 0.05*x*78
23+ 3.9x < 23.56
x<14.35%
Relative reduction is
(0.4-x)/0.4 > 64.1%
Download