CIS 662 Final Name:_________________________ Points:___________/100 1. (30 points) For a given loop and a standard MIPS pipeline (EX stage for ADD.D takes 4 cycles, EX stage for any other instruction takes 1 cycle, branches are resolved in ID stage): Loop: L.D F2, 100(R1) L.D F4, 500(R1) ADD.D F6, F2, F4 S.D F6, 100(R1) DADDUI R1, R1, #4 DADDUI R2, R2,#-1 BNEZ R2, Loop a) (10 points) How many cycles per iteration does this loop take? How many stalls are there? L.D F2, 100(R1) L.D F4, 500(R1) stall ADD.D F6, F2, F4 stall stall S.D F6, 100(R1) DADDUI R1, R1, #4 DADDUI R2, R2,#-1 stall BNEZ R2, Loop stall 12 cycles, 5 stalls b) (10 points) Shuffle instructions around to minimize stalls. You may change offsets if necessary. How many cycles per iteration does modified loop take and how many stalls are left? L.D F2, 100(R1) L.D F4, 500(R1) DADDUI R2, R2,#-1 ADD.D F6, F2, F4 DADDUI R1, R1, #4 BNEZ R2, Loop S.D F6, 96(R1) 7 cycles, no stalls c) (10 points) Unroll the loop twice (so that there are total of two iterations in the unrolled code) and rearrange the code so that there are no remaining stalls. How many cycles does one iteration of the original loop take now? L.D F2, 100(R1) L.D F4, 500(R1) L.D F8, 104(R1) L.D F10, 504(R1) ADD.D F6, F2, F4 ADD.D F12, F8, F10 DADDUI R2, R2,#-2 DADDUI R1, R1, #8 S.D F6, 92(R1) BNEZ R2, Loop S.D F12, 96(R1) 11/2 = 5.5 cycles 2. (20 points) Explain what is speculation and how is it implemented in Tomasulo’s algorithm. You must specify the resources used by Tomasulo’s algorithm to keep track of speculative instructions, how are those resources used and what happens if the speculative decision was wrong. Speculation is the execution of the code following a branch before the outcome of the branch is known. The code is fetched either sequentially or from a target, using branch prediction and the branch target buffer. To perform execution safely we must have a way to undo executed instructions if the branch was mispredicted. This means that no instruction following a branch is allowed to write to memory or registers unless we know the outcome of a branch. In TA this is implemented via a reorder buffer. The results of all instructions are written to the reorder buffer and a new stage – commit – is added to all instructions. During the commit stage results are transferred from the reorder buffer to the memory or registers. Instructions must commit in order. When a branch commits, if the prediction were correct all instructions following a branch will be able to commit in later cycles. If the prediction were incorrect the reorder buffer is flushed and thus, effectively, the instructions are undone. 3. (10 points) Explain how should a 3,2 correlating predictor look like: how many bits it has per branch and how it can be used to predict a branch outcome. How does this predictor change states? 3,2 predictor has 2 bits and they depend on the outcome of 3 previous branches. The predictor would thus have 23=8 slots, and 2 bits in each slot. It would look like this in its initial state: 00/00/00/00/00/00/00/00 The leftmost slot (bolded) would be checked for prediction if three previous branches were all not taken, the slot next to it would be checked if the first two branches were not taken and the last one was taken, etc. The two bits in the chosen slot change their state based on the state diagram of a two-bit predictor. The diagram looks like this: Taken Taken Predict taken 11 Not taken Predict 10 Not taken Taken Taken Predict not taken 01 Not taken Predict n 0 Not taken 4. (40 points) You have a two-level cache with the following specifications: o L1 is 64KB direct-mapped, write-through, no-write-allocate with 32B blocks, hit time of 1ns and miss rate of 5% o L2 is 1MB 4-way set-associative, write-back, write-allocate with 128B blocks and on the average 30% of blocks are dirty. Hit time is 20ns and miss rate is 40% o Penalty to go from L2 to memory is 60ns (this stays the same regardless of how big data chunk we are reading from/writing to memory) o It costs 1ns to transfer one CPU word between L1 and L2. One CPU word is 4B. We are considering making L2 fully associative. This would increase hit time to 21ns but would reduce L2 miss rate. How large should be a miss rate reduction (relative to original miss rate) so that this pays off in case of writes? Hint: compare AMATwrite of the original and the modified processor. AMATread = hitL1 + mrL1*(hitL2+mrL2*penaltyL2) hitL1 = 1ns mrL1=0.05 hitL2=20ns + 32/4*1ns – 28ns mrL2 = 0.4 penaltyL2 = (1+0.3)*60ns = 78ns AMATread = 1 + 0.05*(28+0.4*78) – 3.96 AMATwrite = hitL1 + wthrough + mrL1*mrL2*penaltyL2 Wthrough = 21ns AMATwrite = 1 + 21 + 0.05*0.4*78 = 23.56 With the change AMATwrite = 1 + 22 + 0.05*x*78 = 23 + 0.05*x*78 23+ 3.9x < 23.56 x<14.35% Relative reduction is (0.4-x)/0.4 > 64.1%