(15 pts) Exercise 4-81 • • 1 IF A.) Draw a pipeline stage diagram for the following sequence of instructions. You don’t need fancy pictures – just text for each stage: ID, MEM, etc. SHOW the cycle number for each column, starting at cycle #1. lw $v0, 0($a0) lw $v1, 0($v0) add $a0, $a0, $v1 sub $t0, $t0, $a0 2 3 4 5 6 7 ID EX MEM WB IF ID ** EX MEM WB IF ** ID ** EX IF ** ID 8 9 10 MEM WB EX MEM WB (wait for $v0, needed in EX) (wait for $v1, needed in EX) (no stall b/c forwarding) B.) Now assume that the above sequence of instructions is repeated 100 times (so the processor does a load, another load, an add, a sub, then back to a load, another load, an add, …). There are no branches, just these 4 instructions repeated 100 times. Do NOT reorder the code to improve efficiency. HINT: this is more than a “30 second” question. Think carefully about when each instruction will execute. i.) What is the total number of cycles needed to execute those 400 instructions? No additional stalls needed when going from the sub of one iteration to the first lw of the next. The sequence shown (one iteration) takes a total of 10 cycles. Each additional iteration will add 6 cycles onto this (4 cycles to start the 4 instructions, plus 2 stalls). So the total is 10 + 99 * 6 = 10 + 594 = 604 ii.) What is the average CPI for executing that sequence? CPI = cycles / inst = 604 / 400 = approx 1.5 (15 pts) Exercise 4-82 • A.) Do the same thing as with Exercise 4-81, except assume that there is NO forwarding. lw lw add sub 1 IF 2 ID IF 3 EX ** $v0, $v1, $a0, $t0, 0($a0) 0($v0) $a0, $v1 $t0, $a0 4 5 MEM WB ** ID IF 6 EX ** 7 8 MEM WB ** ID IF 9 EX ** 10 11 12 13 14 (wait for $v0 to reach WB, need in ID) MEM WB (wait for $v1 to reach WB, need in ID) ** ID EX MEM WB (wait for $t0) You might also say that for instruction #2, the ID starts in cycle 3 – at that point we realize that it needs to stall. We write ID in cycle 5 because ID needs to repeat at that point to read $v0 from the register file. (this is UNLIKE when we have forwarding, when after a stall normally we are proceeding and getting a needed value via forwarding, rather than from the register file) B.) Again assume that the above sequence of instructions is repeated 100 times (so the processor does a load, another load, an add, a sub, then back to a load, another load, an add, …). There are no branches, just these 4 instructions repeated 100 times. Do NOT reorder the code to improve efficiency. As with immediately above, assume there is NO forwarding. HINT: again this is more than a “30 second” question. Think carefully about when each instruction will execute. i.) What is the total number of cycles needed to execute those 400 instructions? No additional stalls needed when going from the sub of one iteration to the first of the next (the first lw, which needs $a0, won’t start until $a0 has already been written back by the add instruction). The sequence shown (one iteration) takes a total of 14 cycles. Each additional iteration will add 10 cycles onto this (4 cycles to start the 4 instructions, plus 6 stalls). So the total is 14 + 99 * 10 = 14 + 990 = 1004 ii.) What is the average CPI for executing that sequence? CPI = cycles / inst = 1004 / 400 = approx 2.5 (5 pts) Exercise #4-86 • Can you rewrite this code to eliminate stalls? Circle YES or NO (as usual, assume we do have forwarding) YES 1. 2. 3. 4. • lw lw sub sw $a0, $v0, $t1, $v0, 0($t1) 0($a0) $s2, $t3 4($s1) If any improvement is possible (to reduce stalls), show the new version here: Move #3 to #2. The sw won’t stall b/c it doesn’t need $v0 until the MEM stage – so all stalls removed. NOTE – when re-ordering, must maintain program correctness! For instance, load of $v0 must execute before the store of $v0, because that is order of original program. The subtract uses different registers, so can move freely. (10 pts) Exercise 4-87: True or False? (circle for each) 1. A pipelined implementation will have a faster clock rate than a comparable single cycle implementation ( TRUE or FALSE ) true 2. Pipelining increases performance by splitting up each instruction into stages, thereby decreasing the time needed to execute each instruction ( TRUE or FALSE ) False – individual instructions take longer to finish! 3. “Backwards” branches are likely to be taken ( TRUE or FALSE ) True– backwards is most likely a loop 4. Dynamic branch prediction is done by the compiler ( TRUE or FALSE ) False – done at runtime 5. Most modern processors use pipelining ( TRUE or FALSE ) true