Parallel & Concurrent Programming CS 5334 Problem 2.14 Given a processor array containing 8 processing elements, each capable of performing 10 million integer operations per second, determine the performance in millions of operations per second of this processor array adding two integer vectors for vectors of sizes 10, 20, 30, 40, 50. Solution: If the processing array is capable of 10^7 operations per second, then it takes each processing element 10^-7 seconds to perform an integer operation. For a vector of size 10, and processing element is of size 8, the processing array requires two clock cycles to accomplish the computation. Thus, the performance is: 10/ ( 2*10^-7) = 5 * 10^7 operations per second. If the vector is of size 20, then the processor array requires three clock cycles to add the two vectors. Thus, the performance is: 20/ (3*10^-7) = 6.66 * 10 ^ 7 operations per second. A size 30 vector requires four clock cycles to add the vectors in the processor array . Consequently, the performance is: 30 / (4*10^-7) = 7.5 * 10 ^ 7 operations per second. When using vectors of length 40, the processor array needs 5 clock cycles. Therefore, the performance is: 40/ (5*10^-7) = 8*10^7 operations per second. But, if the vectors added are length 50, six clock cycles are needed by the array. As a result, the performance is: 50 / (6*10^-7) = 8.33 * 10^7 operations per second. Problem 2.17 Why is the number of processors in a centralized multiprocessor limited to a few dozen? Answer: On a centralized multiprocessor, all the processors share a common memory bus. If the number of processors increase, data transmission also increases on the bus. However, the memory bus, with a limited bandwidth, can only support a few processors before it becomes a bottleneck. Problem 2.18 A directory based protocol is a popular way to implement cache coherence on a distributed multiprocessor. a. Why should the directory be distributed among the multiprocessor's local memories? Answer: If the directory existed on only one local memory, the one processor attached to this memory must process the all cache coherence operations. This could potentially lead to a bottleneck. The directory is distributed among the distinct local memories to utilize all processors to handle cache coherence and avoid a bottleneck on any one processor. b. Why are the contents of the directory not replicated? Answer: In a distributed multiprocessor, each processor has its own separate local memory. As copies of these local memories are loaded into caches (of other processors), coherence must be maintained between the other copies and the original block in the primary memory. Since the memory is distributed (not shared), the directory of each local memory must keep track of which processors have cache copies of its own memory block so that, if one processor modifies its cache copy, it invalidates all other copies of it in other caches. The contents of each primary memory may vary. Thus, each directory must keep track of cache copies from distinct memory blocks of distinct primary memories. If the contents of the directory were replicated, this would imply that the memory of each processor contains exactly the same memory blocks, with copies in exactly the same caches. It is inappropriate to assume this. Problem 2.19 Continue the illustration of a directory based cache coherence protocol begun in Figure 2.16. Assume the following five operations now occur in the order listed: 1. CPU 2 reads X 2. CPU 2 writes 5 to X. 3. CPU 1 reads X. 4. CPU 0 reads X. 5. Cpu 1 writes 9 to X. Show the states of the directories, caches, and memories after each of these operations. Answer: See attachment. Problem 2.22 Explain why contemporary supercomputers are invariably multicomputers. Answer: The need to support data and functional parallelism drives supercomputers to incorporate Multiple Instruction and Multiple Data (MIMD). While multiprocessors and multicomputers both fall into this category, multicomputers avoid the pitfalls of cache coherence found in multiprocessors.