Comp326 Practice Exercises 2 Quiz #4 Consider the following program segment computing the dot product of two floating-point vectors whose addresses are stored in registers R1 and R2. loop: LD LD MULTD ADDD SUBI SUBI BNEZ SD F0, 0(R1) F4, 0(R2) F0, F0, F4 F2, F0, F2 R1, R1, #8 R2, R2, #8 R1, loop 0(R3), F2 ; load A[i] ; load B[i] ; compute A[i] * B[i] ; accumulate sum in F2 ; address of A[i-1] ; address of B[i-1] ; loop if necessary ; store the dot product Assume that register F2 is initialized to 0 and register R3 contains the address of the memory location at which the dot product is to be stored. Also assume that the initial value of register R1 is 400. (a) [2 marks] Suppose this program is run on a machine whose cache has 8-word blocks. Assuming that there are only compulsory misses (that is, capacity misses and conflict misses are both zero), compute the miss rate of the given program on this machine. Note that the miss rate is the fraction of memory accesses that are cache misses and include memory accesses for both instructions and data in your computation.. (b) [2 marks] Suppose the machine in (a) has a 4-way interleaved memory system with a memory access time of 10 cycles and a transfer time of one word per cycle. Compute the miss penalty for this memory system.. Comp326 Practice Exercises 2 (c) [2 marks] Assume that the load and store instructions take 4 cycles, the floating point instructions take 8 cycles, the branch instructions take 3 cycles, and all other integer instructions take 5 cycles. Using the results from (a) and (b) determine how much faster will the given program run on the described machine if all memory accesses are cache hits. (d) [2 marks] Write a vector program which performs the same computation as the given program. Your vector program should use vector instructions as much as possible. Identify the vectorized and non-vectorized code in your program.. (e) [2 marks] Analyse your vector program in (d) and determine its performance in clocks per result. Include both the vectorized part and the non-vectorized part of your vector program in your analysis. Assume that you have one vector load-store unit (12-cycle startup delay), one vecotr add unit (6-cycle startup delay), one vector multiply unit (7-cycle startup delay), and one fully pipelined integer functional unit.