CSC/ECE 506: Architecture of Parallel Computers Problem Set 2 Due Wednesday, February 8, 2012 Problems 2, 3, and 5 will be graded. There are 60 points on these problems. Note: You must do all the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on graded problems. Problem 1. (16 points; 4 points each) Assume we have a shared-address-space system built on top of a message-passing multiprocessor, and whenever a page is referenced that is not in local memory, it is transferred across the network. Assume: (1) A page consists of 512 thirty-two-bit words. (2) The bus is 64 bits wide, and runs at 200 MHz. (3) Before anything can be transferred across the bus, we must wait 2 cycles to arbitrate for the bus, plus the memory-access time of 30 ns. (a) When a nonlocal page is referenced, how long does it take before the first word reaches the requesting processor? (b) How much additional time does it take for the page to be transferred? (c) How long does it take to transfer the page? (d) What is the effective bandwidth of the transfer in megabytes/sec.? Problem 2. (15 points) [Solihin 3.3] Code analysis. For the code shown below, … for (i=1; i <=N; i++} { for (j=2; j<=N; j++) { // note the index range S1: a[i][j] = a[i][j-1] + a[i][j-2]; S2: a[i+1][j] = a[i][j] * b[i-1][j]; S3: b[i][j] = a[i][j]; } } (a) Draw the iteration-space traversal graph (ITG). (b) List all the dependences, and clearly indicate which dependence is loop-independent vs. loopcarried. (c) Draw the loop-carried dependence graph (LDG). Problem 3. (20 points) A radix-2 FFT over n complex numbers is implemented as a sequence of log n completely parallel steps, requiring 5n log n floating-point operations while reading and writing each element of data log n times. If this data is spread over the memories in a NUMA design using either a cycle or block distribution, then log n p of the steps will access data in the local memory, assuming both n are powers of two, and in the remaining log p steps half andp of the reads and half the writes will be local. (The choice of layout determines which steps are but the ratio stays the same.) Calculate the communication-tolocal and which are remote, computation ratio on a distributed-memory design where each processor has a local memory, as in Figure 7.2. What communication bandwidth would the network need to sustain for the machine to deliver 250 p MFLOPS on p processors? –1– Problem 4. (24 points) Consider the simple problem of “triangulating” a polygon. To study the properties of a polygonal surface that depend on each point on that surface, the surface is divided into n small triangular regions. Each triangular region is an aggregation of points represented by a single point in the triangle. The complex equations are solved on each such representative point, and the resulting values at each point are summed to determine the current state of the surface at that instant. Assume that it takes 1 unit of time per point to solve the equations and 1 unit of time per point to calculate the sum. (a) What would be the total cost of computation in a sequential program for this application? (b) If this application is decomposed into a 2-phase program, describe each phase. What would be the cost of computation for the serial phase and the parallel phase, total cost of computation and the maximum achievable speedup taking into consideration the number of processors, regardless of the processors? (c) If the application is instead decomposed into a 3-phase program, what would be the answer to the above question? Instead of determining the maximum achievable speedup regardless of the processors, please explain what happens to the speedup as the number of processors increases, assuming n >> p. (d) Is the “concurrency profile” regular or irregular for the above problem? (e) Is this a case of “owner computes” and why? (f) Which of the 4 steps in parallelization is used to ensure consistency and coherence? Is the partitioning architecture dependent? (g) For maximum efficiency, how should the n tasks be divided among the p processors, i.e., how many tasks should each processor be responsible for? Which phase is responsible for this? Problem 5. Ocean parallelization. (25 points) [Solihin 3.10] Given the code for a single sweep in Ocean (without the diff computation) — for (i=1; i <=n; i++} { for (j=1; j<=n; j++) { A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + A[i][j+1] + A[i-1][j] + A[i+1][j]) } } (a) Rewrite the code so that in the new loop nest, each iteration of one of the loops is a parallel task. Exploit parallelism that exists on each anti-diagonal. Then, highlight the loop that is parallel in your new code. (b) Rewrite the code to exploit parallelism using red-black partitioning. Identify the parallel loops. –2–