outline of solutions

advertisement
High Performance Computer Architecture (CS60002)
Mid-Spring Semester 2011-12
OUTLINE OF SOLUTIONS
1. Answer the following.
[6+8+4=18]
a. Consider a k-stage synchronous pipeline with stages S1, S2, …, Sk, and the corresponding
stage delays 1, 2, …, k respectively. Let tmax and tmin denote the time delays of the
longest and shortest logic paths within a stage respectively. If d denotes the width of the
pipeline clock pulse, and s the maximum clock skew, derive an expression for the
maximum speedup of the pipelined processor over an equivalent non-pipelined processor
when processing n sets of data. Clearly state any assumptions you make.
For non-pipelined processor, the latch delays between stages will not come into the
picture. The total processing time for n instructions will be:
n * (1 + 2 + … + k)
For pipelined version, the time will be
(k – 1 + n) * 
here  can be calculated as discussed in the class (from maximum stage delay, tmax and
tmin).
Speedup can be calculated by taking the ratio.
b. Consider a non-linear multifunction pipeline processor consisting of five stages S1, S2, S3,
S4 and S5, and two functions F1 and F2 with the corresponding stage utilizations for one
complete computation as follows:
F1 :
S1 S2 S3 (S2 S4) S4 S1 S5
(requires 7 clock cycles)
F2 :
S1 (S2 S4) (S3 S5) S2 S1
(requires 5 clock cycles)
where (Si Sj) indicates that the stages Si and Sj are being used simultaneously during the
same clock cycle. For both the functions, do the following.
i)
Draw the reservation tables, and show the corresponding collision vectors.
Drawing of the reservation tables is trivial. The collision vectors will be:
F1:
(1 0 0 1 1)
F2:
(1 0 1 0
ii)
Draw the state diagrams showing the permissible state transitions among
successive initiations, and compute the corresponding values of Minimum Average
Latency (MAL) and Minimum Constant Latency (MCL).
5+
3, 4, 6+
1010
10011
5+
5+
1
3
1111
1011
c. A non-pipelined processor X has a clock frequency of 600 MHz and an average CPI
(cycles per instruction) of 4. Processor Y, an improved successor of X, is designed with a
5-stage linear instruction pipeline. However, due to latch delay and clock skew, the clock
frequency of Y is only 450 MHz.
i)
If a program containing 100 instructions is executed on both processors, what is the
speedup of processor Y compared with that of processor X?
Time for nonpipelined processor T1 = 4 x 100 / (600 x 106)
Time for pipelined processor T2 = (100 + 5 – 1) / (450 x 106)
Speedup = T1 / T2 = 2.88
ii)
Calculate the MIPS rating of each processor during the execution of this particular
program.
MIPS for nonpipelined processor = 600 / 4 = 150
MIPS for pipelined processor = 100 x 450 / (100 + 5 – 1) = 432
2. Answer the following.
[4+8=12]
a. Three enhancements with following speedups are proposed for a new architecture:
Speedup1 = 30, Speedup2 = 20, Speedup3 = 10. Only one enhancement is usable at a time.
If enhancements 1 and 2 are each usable for 30% of the time, what fraction of the time
must enhancement 3 be used to achieve an overall speedup of 10?
Speedup = 1 / [ (1 – (F1 + F2 + F3)) + F1/S1 + F2/S2 + F3/S3]
10 = 1 / [(1 – (0.3 + 0.3 + f)) + 0.3/30 + 0.3/20 + f/10]
So, f = 0.3611 = 36.11 %
b. Consider the following fragment of C code:
for
(i=0; i<=100; i++)
{ A[i] = B[i] + C:
}
Assume that A and B are arrays of 64-bit integers, and C and i are 64-bit integers.
Assume that all data values and their addresses are initially kept in memory (at addresses
0, 5000, 8000 and 8500 for A, B, C and i respectively). For efficiency of the code
generated, we decide to keep the values of C and i, and the addresses of the array
variables, in registers.
Write the corresponding code for MIPS64 processor, and compute the following.
LD
R1,R0(0)
// POINT TO A
LD
R2,R0(5000)
// POINT TO B
LD
R3,R0(8000)
// POINT TO C
LD
R4,R0(8500)
// POINT TO i
DADDI
R5,R0,#101
// LOOP COUNTER
LOOP: DADD R7,R2,R4
LD
R6,R7(0)
DADD
R8,R6,R3
DADD
R7,R1,R4
SD
R8,R7(0)
DADDI
R4,R4,#1
DSUBI
R5,R5,#1
BNEZ
R5,LOOP
THERE ARE ERRORS IN
THIS CODE. THE ARRAY
POINTERS ARE NOT
UPDATED CORRECTLY. TRY
TO FIND OUT.
i)
The number of instructions executed
5 + 8 x 101 = 813
ii)
The number of memory data references
4 + 2 x 101 = 206
iii) The code size in bytes.
13 x 4 = 52 bytes
3. Answer the following.
[(12+3)+(5+5+5)=30]
a. Draw a schematic diagram of the 5-stage integer pipeline for the MIPS64 instruction set,
and hence show the micro-operations that are carried out in the five stages. What
modifications are required to implement data forwarding?
b. Consider the following code fragment of MIPS64:
loop: LD
R1, 0(R2)
DADDI
R1, R1, #1
SD
R1, 0(R2)
DADDI
R2, R2, #8
DSUB
R4, R3, R2
BNEZ
R4, loop
//
//
//
//
//
//
R1 = M[0 + R2]
R1 = R1 + 1
M[0+R2] = R1
R2 = R2 + 8
R4 = R3 – R2
branch if R4 not zero
i)
Show the partial timing diagram (for one loop) of this instruction sequence for the
MIPS64 pipeline without any forwarding or bypassing hardware. Assume two
register reads and one register write can be carried out in one clock cycle. Also
assume that the branch is handled by flushing the pipeline. If there is no structural
hazards while accessing memory, how many clock cycles does this loop take to
execute?
ii)
Show the partial timing diagram (for one loop) and calculate the number of clock
cycles required to execute the entire loop, assuming that the normal forwarding/
bypassing hardware has been implemented, and the branch is handled by predicting
it as not taken.
iii) Assume now that branch is handled with a single-cycle delayed branch (that is,
there is one branch delay slot). Try to fill up the branch delay slot by reordering
(scheduling) instructions. Again show the partial timing diagram (for one loop) and
calculate the number of clock cycles required to execute the entire loop.
SOLUTION TO THIS QUESTION IS NOT SHOWN. THEY
ARE COVERED IN THE CLASS.
Download