Homework 4 - CPSC 614 Computer Architecture Due: May 4 10:00 AM 1) (15 pts ) Explain the rationale of the following techniques for cache performance optimizations (be brief) a) (3 pts) Giving Priority to Read Misses over Writes b) (3 pts) Merging Write Buffer c) (3 pts) Victim Caches d) (3 pts) Nonblocking Caches f) (3 pts) Prefetching -1- 2. (20 pts) Suppose that a 32-bit byte-addressable CPU accesses the memory in the following order: 47332, 48256, 47350, 48201, 47400, 48209, 47377, and 48211. Assume that we have 4 cache blocks. Initially the cache is empty. a) (5 pts) If the cache is direct-mapped, show the final contents of the cache using memory block number. Assume 1 block is 4 words. (Ignore the tag field and other bits.) What is the total number of misses? b) (5 pts) If the cache is 2-way set associative with FIFO replacement policy, show the final contents of the cache using memory block number. Assume 1 block is 8 words. (Ignore the tag field and other bits.) What is the total number of misses? c) (5 pts) If the cache is fully associative with LRU replacement policy, show the final contents of the cache using memory block number. Assume 1 block is 2 words. (Ignore the tag field and other bits.) What is the total number misses? d) (5 pts) For each cache (direct-mapped, 2-way set associative, and fully associative), calculate the total size of the tag fields. -2- 3. (20 pts) Consider a 1GHz processor with single-level, split instruction and data caches. Both caches are write-through with write buffers (assuming write buffer does not stall), and have single-cycle hit time and 32Byte blocks. Assuming there is no L2 cache. Main memory is interleaved, with 4 independent 8-Byte memory banks and a 50ns latency. The memory bus is 8 Bytes wide and has a clock frequency of ¼ of the processor frequency. The instruction cache miss rate is 0.1% and the data cache miss rate is 5% for loads, and 5% for stores. Assume 20% of the instructions are loads and 10% are stores. Determine: a) (6 pts) The miss penalty. b) (7 pts) The average memory access time. c) (7 pts) The average number of bytes transferred on the bus (between memory and cache) for every 100 instructions executed. (Assuming every store write 32bit information) 4. The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose: -3- for (i = 0; i < 3; i++) { for (j = 0; j < 3; j++) { output[j][i] = input[i][j]; } } Assume both the input and output matrices are stored in the row major order (row major order means row index changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16 KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2 cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles. For the simple implementation given above, this execution order would be nonideal for the input matrix. However, applying a loop interchange optimization would create a nonideal order for output matrix. Because loop interchange is not sufficient to improve its performance, it must be blocked instead. a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the level 1 cache is fully associative 64 KB? b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance independent of both arrays’ position in memory? c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above. The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More complicated “nonunit stride” hardware prefetchers can analyze a miss reference stream, and detect and prefetch nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady state of the inner loop, how many prefetches need to be outstanding at a given time? 5. For this problem, we will use a simplistic model to estimate the service time of a disk. In this basic model, the average positioning and transfer time for a small random request is a linear function of the seek distance. For the 40 GB disk in this problem, assume that the service time is 5 ms * space utilization. Thus, if the entire 40 -4- GB disk is used, then the average positioning and transfer time for a random request is 5 ms; if only the first 20 GB of the disk is used, then the average positioning and transfer time is 2.5 ms. Throughout this case study, you can assume that the processor sends 167 small random disk requests per second and that these requests are exponentially distributed. You can assume that the size of the requests is equal to the block size of 8 KB. Each disk in the system has a capacity of 40 GB. Regardless of the storage system configuration, the workload accesses a total of 40 GB of data; you should allocate the 40 GB of data across the disks in the system in the most efficient manner. Suppose that the storage system is configured to contain N 40 GB disks in a RAID 0 array; that is, the data is striped in blocks of 8 KB across the N disks with no redundancy. a. (10 pts) Describe how you can use queuing theory to model this system. b. (16 pts) Investigate how response time changes as the number of disks in the RAID 0 array is increased. Report the average service time, utilization, wait time, and the average response time for RAID 0 arrays with N = 1, 2, 4, and 8 disks. You should continue to assume that the request workload is over a total of 40 GB. 6. (14 pts) Show how a RAID tolerate up to two errors using row-diagonal parity. Your six-disk RAID system is based on RAID 4, with p = 5, as shown in the following table. If data disk 2 and data disk 3 fail, how can those disks be reconstructed? Show the sequence of steps that are required to compute the missing blocks in the first four stripes. -5- 7. The following loop is the DAXPY loop, which implements the operation Y = AX + Y for a vector of length 100. Initially, R1 = 0 and F0 contains a. foo: L.D MUL.D L.D ADD.D S.D DADDIU DADDIU DADDIU BNEQZ F2, 0 (R1) F4, F2, F0 F6, 0 (R2) F8, F4, F6 F8, 0 (R2) R1, R1, #8 R2, R2, #8 R3, R1, #-800 R3, foo Inst. Producing Inst. using Latency in result result cycles FP ALU op Another FP op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 (branch 1, int ALU- branch : 1 cycle stall) a. (10 pts) Show a software pipelined version of this loop. You may omit the start-up and clean-up code. -6- -7-