CSC/ECE 506: Architecture of Parallel Computers Summer 2006 Problem Set 2 Due Wednesday, June 28, 2006 Problems 2, 3, and 4 will be graded. There are 60 points on these problems. Note: You must do all the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on the graded problems. Problem 1. (20 points) [CS&G 7.2] A radix-2 FFT over n complex numbers is implemented as a sequence of log n completely parallel steps, requiring 5n log n floating-point operations while reading and writing each element of data log n times. If this data is spread over the memories in a NUMA design using either a cycle or block distribution, the log n p of the steps will access data in the local memory, assuming both n and p are powers of two, and in the remaining p steps half of the reads and half the writes will be local. (The choice of layout determines log which steps are local and whichareremote, but the ratio stays the same.) Calculate the communication-to-computation ratio on a distributed-memory design where each processor has a local memory, as in Figure 7.2. What communication bandwidth would the network need to sustain for the machine to deliver 250 p MFLOPS on p processors? Problem 2. (15 points) The following instruction sequence is for two processors, P1 and P2. 1 P1: READ 4 2 P2: READ 1 3 P1: READ 1 4 P1: WRITE 4 5 P2: READ 1 6 P1: READ 5 7 P2: WRITE 5 8 P1: READ 5 9 P1: READ 6 10 P2: READ 1 11 P1: WRITE 6 12 P2: READ 2 13 P1: READ 7 Here, READ 4 is a read to location 4 in memory given the following configuration. Block A and Block B are memory blocks that map to the same cache line. Each block contains 4 words. The values listed are the memory addresses and not integer values. A:1234 B:5678 Each processor has a direct-mapped cache with 2 cache lines that have 4 words per cache line. Block transfers between memory and cache costs 8 clock cycles while cache-to-cache transfers cost 2 clock cycles. Invalidations as well as both read and write hits cost 1 cycle. All caches are initially empty. Compute the total time needed to execute the given sequence, using the MESI protocol for a busbased shared-memory multiprocessor. When a miss happens, categorize it as one of the following: Cold misses occur the first time that a block is referenced. Conflict misses are misses that would not occur if the cache were (the same size, but) fully associative with LRU replacement. Capacity misses occur when the cache size is not sufficient to hold data between references. Coherence misses are misses caused by invalidations to preserve cache coherence. For coherence misses, indicate whether it’s false sharing or true sharing. True sharing occurs when one processor writes some words in a cache block, invalidating that block in another processor’s cache, after which the second processor reads one of the modified words. Problem 3. (20 points) The write-back strategy and dirty bits are related to the write strategy of a cache. Assume that we have a memory in which the largest address referenced is FFFF. We also have a cache which has a 16 blocks or lines. Consider the following references to this cache: 1. 2. 3. 4. 5. 6. read 04CF read F3C7 write 0423 write 0433 write 2BC4 read 04BF This sequence is repeated 20 times, for a total of 120 accesses. Assume— This cache is empty initially and the entire block is fetched on a miss The block size is 16 words The cache allocates storage on write misses and uses the write-back replacement policy. (a) What is the total number of misses (read misses + write misses) for the cache, if the cache is direct mapped? (b) What is the total number of write-backs of dirty blocks for this direct-mapped cache? (c) Find the smallest direct-mapped cache size such that there are no misses besides compulsory misses. (d) If the cache is 2-way set-associative, find the smallest cache size such that there are no misses other than compulsory misses. Problem 4. (25 points) For each of the four different kinds of cache organizations, display one sequence of block references for which the organization outperforms the other three organizations (i.e., one sequence for which direct mapping performs the best of all organizations, one sequence for which fully associative is the best of all, etc.; four sequences altogether). To make your answers short, assume— there are four block frames in the cache; the set or sector size (where applicable) is 2 block frames; in all cases, LRU replacement is used. None of your examples should require more than about ten block references. Problem 5. (20 points) Consider a parallel program using a client-server model. This program has one process acting as a server. The server continuously reads the status of the clients, then computes and stores results in memory. The client process continuously read the server’s result and computes its local variables. The client then updates its status in the memory. (a) If this program was to run on a bus based system, what kind of cache coherence protocol would provide the best result (invalidation or update)? Explain your answer. (b) When using invalidation protocols, which type of coherence cache misses will most likely be encountered in the above program? How can we reduce them? (c) Instead of using client-server method, we can have all the processors compute and exchanging data to each other. What are the advantages and disadvantages over the clientserver model? (d) Consider protocols that flush data out to the bus when one processor tries to read the data that is already stored in another processor’s cache. Instead of having to go to memory to fetch the data block, we can pick the data block up from the bus when it’s being flushed to memory. This is called cache-to-cache sharing. Is this an advantage over always fetching from memory? In what condition does fetching from memory would be more preferred?