Solutions

advertisement
CSE 7381/5381
Computer Architecture
Solution set for HW #1
2.2 (a) Keep in mind that because the new design has a clock cycle time equal to 1.10
times the old clock cycle time, the new design must execute fewer instructions to
achieve the same execution time.
CPU Time = CPI x Clock Cycle Time x Instruction Count
= CPI x Clk x IC
For the original configuration, this equation yields:
CPU Time old = CPI old x Clk old x IC old
and for the modifed confguration:
CPU Time new = CPI new x Clk new x IC new
= CPI old x 1.10 x Clk old x IC old
To find out how many loads must be removed to achieve the same performance, set the
above equations equal and solve for the ratio of instruction counts:
CPI old x Clk old x IC old = CPI old x 1.10 x Clk old x (IC old – R) which yields R =
0.91. Thus, 39.5% (ie., 9% of the given 22…8% loads) of the loads must be replaced
with the new loadoperation combined instruction for the performance of the old and new
configurations to be the same.
(b) In this exercise, we are asked for a sequence of instructions for which the
register-memory addressing mode described in the exercise cannot be used.
Consider a slightly modified version of the code presented in the exercise description:
ld r1, 0(r1)
add r1, r1, r1
In this code we have simply replaced registers r2 and rb in the original
sequence with register r1. If we assume that r1 has the value 47 and the data stored at
location 47 in memory has the value 4 before the code sequence, then r1 will contain 8
after this code sequence executes. However, if we use the register-memory version ofthis
sequence:
add r1, 0(r1)
then, assuming the same register and memory contents as above, r1 has the value 51 after
execution. Thus, this case cannot be replaced with the new instruction sequence.
3.1 To answer this question only requires that we go through one complete iteration of
the loop and the first instruction in the next iteration. There are several cycles lost to
stalls.
• Cycles 3-4: addi stalls ID to wait for lw to write back r1.
• Cycles 6-7: sw stalls ID to wait for addi to write back r1.
•
•
•
•
Cycles 10-11: sub stalls ID to wait for addi to write back r2.
Cycles 13-14: bnz stalls ID to wait for sub to write back r4.
Cycles 16-17: bnz computes the next PC in MEM implying that the
lw cannot be fetched until after cycle 17 (note the fetch in cycle 15 is also wasted)
We have assumed the version of DLX described in Figure 3.21 in the text, which resolves
branches in MEM. The second iteration begins 17 clocks after the first iteration and the
last iteration takes 18 cycles to complete. This implies that iteration i (where iterations
are numbered from 0 to 98) begins on clock cycle 1 + (i x 17). As the loop executes 99
times the loop executes in a total of (98 x 17) + 18 = 1684 clocks.
3.1 (b) Very similar to 3.1(a) except that branches are resolved by predicting them as
not taken and that normal forwarding is implemented for the all the functional units
needed for this code sequence.
There are two stalls:
•
•
Cycle 4: addi stalls EX to wait for lw to forward r1.
Cycles 8-10: bnz computes the next PC in MEM implying that the lw cannot be
fetched until after cycle 10 (note the fetch in cycle 7 is also wasted)
Since the bnz is predicted as not taken, the instruction following bnz in memory is
fetched and goes to decode before the pipeline realizes that it has mispredicted the branch
and the correct instruction (lw) is fetched and started. The cycles lost to the misprediction
assume that the branchis resolved in MEM as shown in Figure 3.21 of the text for
reference. One could also use the version of DLX speci#ed in Figure 3.23 of the text
which resolves branches in ID. Doing so would change the outcome of this exercise
slightly.
Now, the second iteration begins 10 clocks after the first iteration and the last iteration
takes 11 cycles to finish. This implies that iteration i (where iterations are numbered from
0 to 98) begins on clock cycle 1 + (i x 10). As the loop executes 99 times the loop
executes in a total of (98 x 10) + 11 = 991 clocks.
3.1(C) There are two simple optimizations that can be used to remove the stalls
from the loop. First, we fill the load delay slot by moving the increment of r2 before the
increment of r1 (which depends on the lw instruction). This allows the addi instructions
to execution without stalling since the result of the lw is not needed for an extra cycle.
Second, wewant to fill the branch delay slot and we can use the sw instruction if we
adjust the offset value to reflect that r2 has already been incremented before the sw
instruction is executed.
One possible schedule of the code is shown in the following:
loop:
lw r1,0(r2) ; load A[I]
addi r2,r2,#4; bump A[I] pointer
addi r1,r1,#1 ; bump A[I]
sub r4,r3,r2 ; loop term. test...
bnz r4,loop ; go back if not done
sw r1,-4(r2) ; store A(I)
In general, to schedule pipelined code, you should attempt to ensure that any instruction
i using a result of an instruction j does not issue before the latency of instruction j has
been satisfied. Also, any codemovement must not disturb the function of the program.
A calculation done similar to the previous two sections will yield the total number of
cycles = 598
7.9 This exercise considers a fat tree like that shown in Figure 7.14 in the text but with
half as many processors. However, the exercise does not specify whether you are to use
the same size switches as Figure 7.14. The results of the comparison may differ if you
elect to implement the fat tree with smaller switches in order to make the switch size
consistent with the other interconnects being examined.
In each of the systems we consider, there are eight processing elements connected
through one of three types of networks.
•
•
•
Crossbar: Processing elements (Pes) are at most one switch away from any other PE.
Omega: For eight PEs, this network allows each PE to be exactly three switches away
from all other PEs.
Fat tree: The number of switches varies as the distance' between the PEs changes.
The table presents various communication times in cycles assuming the formats shown in
the text. We assume that a message takes one cycle to pass through a switch and that the
message does not have any competition for the switches during its transition through the
network. Also, for the fat tree we assume it is built out of four-input switches. This serves
to skew the comparison with the omega network, which only uses two-input switches.
Reducing the size of the switch adds additional levels of switching, which will only
impact the worst case time.
Network
Cross-bar
Best case
(Cycles)
1
Worst
case
1
P0-P6
P1-P7
1
1
Omega
3
3
3
3
Fat-tree
1
3
3
3
While the crossbar and omega networks both provide consistent latencies regardless of
the routing, there can be differences in the fat tree latencies based on the source and
destination of the routing. This should be obvious from Figure 7.14. One of the nice
features of fat trees is that messages between local processing elements do not have to
work all the way into the network to be handled (e.g., compare the paths between P0 and
P1 with the paths between P0 and P15 in Figure 7.14
Problem on the Cube Network:
Refer to class notes
Download