Exam 1 - TAMU Computer Science Faculty Pages

advertisement
Homework 4 - CPSC 614 Computer Architecture
Due: May 4 10:00 AM
1) (15 pts ) Explain the rationale of the following techniques for cache performance optimizations (be brief)
a) (3 pts) Giving Priority to Read Misses over Writes
b) (3 pts) Merging Write Buffer
c) (3 pts) Victim Caches
d) (3 pts) Nonblocking Caches
f) (3 pts) Prefetching
-1-
2. (20 pts) Suppose that a 32-bit byte-addressable CPU accesses the memory in the following order: 47332,
48256, 47350, 48201, 47400, 48209, 47377, and 48211. Assume that we have 4 cache blocks. Initially the
cache is empty.
a) (5 pts) If the cache is direct-mapped, show the final contents of the cache using memory block number.
Assume 1 block is 4 words. (Ignore the tag field and other bits.) What is the total number of misses?
b) (5 pts) If the cache is 2-way set associative with FIFO replacement policy, show the final contents of the
cache using memory block number. Assume 1 block is 8 words. (Ignore the tag field and other bits.) What is
the total number of misses?
c) (5 pts) If the cache is fully associative with LRU replacement policy, show the final contents of the cache
using memory block number. Assume 1 block is 2 words. (Ignore the tag field and other bits.) What is the total
number misses?
d) (5 pts) For each cache (direct-mapped, 2-way set associative, and fully associative), calculate the total size of
the tag fields.
-2-
3. (20 pts) Consider a 1GHz processor with single-level, split instruction and data caches. Both caches are
write-through with write buffers (assuming write buffer does not stall), and have single-cycle hit time and 32Byte blocks. Assuming there is no L2 cache. Main memory is interleaved, with 4 independent 8-Byte memory
banks and a 50ns latency. The memory bus is 8 Bytes wide and has a clock frequency of ¼ of the processor
frequency. The instruction cache miss rate is 0.1% and the data cache miss rate is 5% for loads, and 5% for
stores. Assume 20% of the instructions are loads and 10% are stores. Determine:
a) (6 pts) The miss penalty.
b) (7 pts) The average memory access time.
c) (7 pts) The average number of bytes transferred on the bus (between memory and cache) for every 100
instructions executed. (Assuming every store write 32bit information)
4. The transpose of a matrix interchanges its rows and columns. Here is a simple C loop to show the transpose:
-3-
for (i = 0; i < 3; i++) {
for (j = 0; j < 3; j++) {
output[j][i] = input[i][j];
}
}
Assume both the input and output matrices are stored in the row major order (row major order means row index
changes fastest). Assume you are executing a 256 × 256 double-precision transpose on a processor with a 16
KB fully associative (so you don’t have to worry about cache conflicts) LRU replacement level 1 data cache
with 64-byte blocks. Assume level 1 cache misses or prefetches require 16 cycles, always hit in the level 2
cache, and the level 2 cache can process a request every 2 processor cycles. Assume each iteration of the inner
loop above requires 4 cycles if the data is present in the level 1 cache. Assume the cache has a write-allocate
fetch-on-write policy for write misses. Unrealistically assume writing back dirty cache blocks requires 0 cycles.
For the simple implementation given above, this execution order would be nonideal for the input matrix.
However, applying a loop interchange optimization would create a nonideal order for output matrix. Because
loop interchange is not sufficient to improve its performance, it must be blocked instead.
a. (5 pts) What block size should be used to completely fill the data cache with one input and output block if the
level 1 cache is fully associative 64 KB?
b. (5 pts) What is the minimum associativity required of the level 1 cache for consistent performance
independent of both arrays’ position in memory?
c. (10 pts) Assume you are designing a hardware prefetcher for the unblocked matrix transposition code above.
The simplest type of hardware prefetcher only prefetches sequential cache blocks after a miss. More
complicated “nonunit stride” hardware prefetchers can analyze a miss reference stream, and detect and prefetch
nonunit strides. Assume prefetches write directly into the cache and no pollution (overwriting data that needs to
be used before the data that is prefetched). For best performance given a nonunit stride prefetcher, in the steady
state of the inner loop, how many prefetches need to be outstanding at a given time?
5. For this problem, we will use a simplistic model to estimate the service time of a disk. In this basic model, the
average positioning and transfer time for a small random request is a linear function of the seek distance. For
the 40 GB disk in this problem, assume that the service time is 5 ms * space utilization. Thus, if the entire 40
-4-
GB disk is used, then the average positioning and transfer time for a random request is 5 ms; if only the first 20
GB of the disk is used, then the average positioning and transfer time is 2.5 ms.
Throughout this case study, you can assume that the processor sends 167 small random disk requests per second
and that these requests are exponentially distributed. You can assume that the size of the requests is equal to the
block size of 8 KB. Each disk in the system has a capacity of 40 GB. Regardless of the storage system
configuration, the workload accesses a total of 40 GB of data; you should allocate the 40 GB of data across the
disks in the system in the most efficient manner.
Suppose that the storage system is configured to contain N 40 GB disks in a RAID 0 array; that is, the data is
striped in blocks of 8 KB across the N disks with no redundancy.
a. (10 pts) Describe how you can use queuing theory to model this system.
b. (16 pts) Investigate how response time changes as the number of disks in the RAID 0 array is increased.
Report the average service time, utilization, wait time, and the average response time for RAID 0 arrays with N
= 1, 2, 4, and 8 disks. You should continue to assume that the request workload is over a total of 40 GB.
6. (14 pts) Show how a RAID tolerate up to two errors using row-diagonal parity. Your six-disk RAID system
is based on RAID 4, with p = 5, as shown in the following table. If data disk 2 and data disk 3 fail, how can
those disks be reconstructed? Show the sequence of steps that are required to compute the missing blocks in the
first four stripes.
-5-
7. The following loop is the DAXPY loop, which implements the operation Y = AX + Y for a vector of length
100. Initially, R1 = 0 and F0 contains a.
foo:
L.D
MUL.D
L.D
ADD.D
S.D
DADDIU
DADDIU
DADDIU
BNEQZ
F2, 0 (R1)
F4, F2, F0
F6, 0 (R2)
F8, F4, F6
F8, 0 (R2)
R1, R1, #8
R2, R2, #8
R3, R1, #-800
R3, foo
Inst. Producing Inst. using
Latency in
result
result
cycles
FP ALU op Another FP op
3
FP ALU op
Store double
2
Load double
FP ALU op
1
Load double Store double
0
(branch 1, int ALU- branch : 1 cycle stall)
a. (10 pts) Show a software pipelined version of this loop. You may omit the start-up and clean-up code.
-6-
-7-
Download