doc

advertisement
Comp326 Practice Exercises
2
Quiz #4
Consider the following program segment computing the dot product of two floating-point
vectors whose addresses are stored in registers R1 and R2.
loop: LD
LD
MULTD
ADDD
SUBI
SUBI
BNEZ
SD
F0, 0(R1)
F4, 0(R2)
F0, F0, F4
F2, F0, F2
R1, R1, #8
R2, R2, #8
R1, loop
0(R3), F2
; load A[i]
; load B[i]
; compute A[i] * B[i]
; accumulate sum in F2
; address of A[i-1]
; address of B[i-1]
; loop if necessary
; store the dot product
Assume that register F2 is initialized to 0 and register R3 contains the address of the
memory location at which the dot product is to be stored. Also assume that the initial
value of register R1 is 400.
(a) [2 marks] Suppose this program is run on a machine whose cache has 8-word blocks.
Assuming that there are only compulsory misses (that is, capacity misses and conflict
misses are both zero), compute the miss rate of the given program on this machine.
Note that the miss rate is the fraction of memory accesses that are cache misses and
include memory accesses for both instructions and data in your computation..
(b) [2 marks] Suppose the machine in (a) has a 4-way interleaved memory system with a
memory access time of 10 cycles and a transfer time of one word per cycle. Compute
the miss penalty for this memory system..
Comp326 Practice Exercises
2
(c) [2 marks] Assume that the load and store instructions take 4 cycles, the floating
point instructions take 8 cycles, the branch instructions take 3 cycles, and all other
integer instructions take 5 cycles. Using the results from (a) and (b) determine how
much faster will the given program run on the described machine if all memory
accesses are cache hits.
(d) [2 marks] Write a vector program which performs the same computation as the
given program. Your vector program should use vector instructions as much as
possible. Identify the vectorized and non-vectorized code in your program..
(e) [2 marks] Analyse your vector program in (d) and determine its performance in
clocks per result. Include both the vectorized part and the non-vectorized part of
your vector program in your analysis. Assume that you have one vector load-store
unit (12-cycle startup delay), one vecotr add unit (6-cycle startup delay), one vector
multiply unit (7-cycle startup delay), and one fully pipelined integer functional unit.
Download