CSE 7381/5381 Computer Architecture Solution set for HW #1 2.2 (a) Keep in mind that because the new design has a clock cycle time equal to 1.10 times the old clock cycle time, the new design must execute fewer instructions to achieve the same execution time. CPU Time = CPI x Clock Cycle Time x Instruction Count = CPI x Clk x IC For the original configuration, this equation yields: CPU Time old = CPI old x Clk old x IC old and for the modifed confguration: CPU Time new = CPI new x Clk new x IC new = CPI old x 1.10 x Clk old x IC old To find out how many loads must be removed to achieve the same performance, set the above equations equal and solve for the ratio of instruction counts: CPI old x Clk old x IC old = CPI old x 1.10 x Clk old x (IC old – R) which yields R = 0.91. Thus, 39.5% (ie., 9% of the given 22…8% loads) of the loads must be replaced with the new loadoperation combined instruction for the performance of the old and new configurations to be the same. (b) In this exercise, we are asked for a sequence of instructions for which the register-memory addressing mode described in the exercise cannot be used. Consider a slightly modified version of the code presented in the exercise description: ld r1, 0(r1) add r1, r1, r1 In this code we have simply replaced registers r2 and rb in the original sequence with register r1. If we assume that r1 has the value 47 and the data stored at location 47 in memory has the value 4 before the code sequence, then r1 will contain 8 after this code sequence executes. However, if we use the register-memory version ofthis sequence: add r1, 0(r1) then, assuming the same register and memory contents as above, r1 has the value 51 after execution. Thus, this case cannot be replaced with the new instruction sequence. 3.1 To answer this question only requires that we go through one complete iteration of the loop and the first instruction in the next iteration. There are several cycles lost to stalls. • Cycles 3-4: addi stalls ID to wait for lw to write back r1. • Cycles 6-7: sw stalls ID to wait for addi to write back r1. • • • • Cycles 10-11: sub stalls ID to wait for addi to write back r2. Cycles 13-14: bnz stalls ID to wait for sub to write back r4. Cycles 16-17: bnz computes the next PC in MEM implying that the lw cannot be fetched until after cycle 17 (note the fetch in cycle 15 is also wasted) We have assumed the version of DLX described in Figure 3.21 in the text, which resolves branches in MEM. The second iteration begins 17 clocks after the first iteration and the last iteration takes 18 cycles to complete. This implies that iteration i (where iterations are numbered from 0 to 98) begins on clock cycle 1 + (i x 17). As the loop executes 99 times the loop executes in a total of (98 x 17) + 18 = 1684 clocks. 3.1 (b) Very similar to 3.1(a) except that branches are resolved by predicting them as not taken and that normal forwarding is implemented for the all the functional units needed for this code sequence. There are two stalls: • • Cycle 4: addi stalls EX to wait for lw to forward r1. Cycles 8-10: bnz computes the next PC in MEM implying that the lw cannot be fetched until after cycle 10 (note the fetch in cycle 7 is also wasted) Since the bnz is predicted as not taken, the instruction following bnz in memory is fetched and goes to decode before the pipeline realizes that it has mispredicted the branch and the correct instruction (lw) is fetched and started. The cycles lost to the misprediction assume that the branchis resolved in MEM as shown in Figure 3.21 of the text for reference. One could also use the version of DLX speci#ed in Figure 3.23 of the text which resolves branches in ID. Doing so would change the outcome of this exercise slightly. Now, the second iteration begins 10 clocks after the first iteration and the last iteration takes 11 cycles to finish. This implies that iteration i (where iterations are numbered from 0 to 98) begins on clock cycle 1 + (i x 10). As the loop executes 99 times the loop executes in a total of (98 x 10) + 11 = 991 clocks. 3.1(C) There are two simple optimizations that can be used to remove the stalls from the loop. First, we fill the load delay slot by moving the increment of r2 before the increment of r1 (which depends on the lw instruction). This allows the addi instructions to execution without stalling since the result of the lw is not needed for an extra cycle. Second, wewant to fill the branch delay slot and we can use the sw instruction if we adjust the offset value to reflect that r2 has already been incremented before the sw instruction is executed. One possible schedule of the code is shown in the following: loop: lw r1,0(r2) ; load A[I] addi r2,r2,#4; bump A[I] pointer addi r1,r1,#1 ; bump A[I] sub r4,r3,r2 ; loop term. test... bnz r4,loop ; go back if not done sw r1,-4(r2) ; store A(I) In general, to schedule pipelined code, you should attempt to ensure that any instruction i using a result of an instruction j does not issue before the latency of instruction j has been satisfied. Also, any codemovement must not disturb the function of the program. A calculation done similar to the previous two sections will yield the total number of cycles = 598 7.9 This exercise considers a fat tree like that shown in Figure 7.14 in the text but with half as many processors. However, the exercise does not specify whether you are to use the same size switches as Figure 7.14. The results of the comparison may differ if you elect to implement the fat tree with smaller switches in order to make the switch size consistent with the other interconnects being examined. In each of the systems we consider, there are eight processing elements connected through one of three types of networks. • • • Crossbar: Processing elements (Pes) are at most one switch away from any other PE. Omega: For eight PEs, this network allows each PE to be exactly three switches away from all other PEs. Fat tree: The number of switches varies as the distance' between the PEs changes. The table presents various communication times in cycles assuming the formats shown in the text. We assume that a message takes one cycle to pass through a switch and that the message does not have any competition for the switches during its transition through the network. Also, for the fat tree we assume it is built out of four-input switches. This serves to skew the comparison with the omega network, which only uses two-input switches. Reducing the size of the switch adds additional levels of switching, which will only impact the worst case time. Network Cross-bar Best case (Cycles) 1 Worst case 1 P0-P6 P1-P7 1 1 Omega 3 3 3 3 Fat-tree 1 3 3 3 While the crossbar and omega networks both provide consistent latencies regardless of the routing, there can be differences in the fat tree latencies based on the source and destination of the routing. This should be obvious from Figure 7.14. One of the nice features of fat trees is that messages between local processing elements do not have to work all the way into the network to be handled (e.g., compare the paths between P0 and P1 with the paths between P0 and P15 in Figure 7.14 Problem on the Cube Network: Refer to class notes