CSC/ECE 506: Architecture of Parallel Computers Problem Set 1 Due Wednesday, January 25, 2012 Problems 1, 2, and 4 will be graded. There are 55 points on these problems. Note: You must do all the problems, even the non-graded ones. If you do not do some of them, half as many points as they are worth will be subtracted from your score on the graded problems. Problem 1. (15 points) In message-passing models, each process is provided with a special variable or function that gives its unique number or rank among the set of processes executing the program. Most shared memory programming systems provide a fetch&inc (fetch and increment) operation, which reads the value of a location and atomically increments the value at the location. (a) Write a little pseudocode the show how this fetch&inc instruction can be used to assign each process a unique number. Assume that the number of processes is known, and is stored in the variable number_procs, and that each process waits till all processes have been assigned a rank before proceeding forward. Comment your pseudocode to clearly indicate the variable that holds the rank, and the variable on which the fetch&inc is performed. (b) If the number of processes is not known, can you determine it in a manner similar to part (a) above? Problem 2. (25 points) A parallel computation on an n-processor system can be characterized by a pair P(n), T(n), where P(n) is the total number of instructions executed by all the processors and T(n) is the elapsed execution time for the entire system (measured in number of instructions). (You may assume that all instructions take the same amount of time.) P(n), for n > 1, may be greater than P(1) because some of the processors have to do extra “redundant” work to synchronize or avoid excessive memory contention. However, assume that P(n) is never less than P(1). In a serial computation, all instructions are performed by a single processor, so P (1) = T (1). Usually, for n > 1, T(n) < P(n) because the computation will finish faster on a multiprocessor. Lee (1980) suggested five performance indices for comparing a parallel computation with a serial computation. T(1) S(n) = T(n) (The speedup) T(1) E(n) = nT(n) (The efficiency ) P(n) R(n) = P(1) (The redundancy) P(n) U(n) = n T(n) (The utilization) Q (n) = T 3(1) n T 2(n) P(n) (The quality) Note: T 2(n ) = T (n ) T (n ) (a) Prove that the following relationships hold in all possible comparisons of parallel to serial computations: (1) 1 ≤ S (n ) ≤ n (3) U (n ) = R (n ).E (n ) –1– (2) E (n ) = S (n ) n (4) Q (n ) = S (n ).E (n ) R (n ) (b) Based on the above definitions and relationships, give physical meanings of these performance indices. Problem 3. (15 points) Below is a section of code that subtracts two N x N matrices, and sets any negative values in the results to zero. The psuedocode for this is as follows: for (i = 1 to n) do C[i] = A[i] – B [i ] // (entire row) for_all C[i] < 0 set C[i ] =0 end for Using the array-processor instructions from Lecture 3 to set appropriate mask bits, implement the psuedocode above. You may assume that there is a row of size n containing all –1's loaded into location P in memory, and a row of all 0's into row Q in memory. Problem 4. (15 points) Suppose a program that was being run on one processor is now run on a 100-processor machine. If a speedup of 80 is desired (on the 100-processor machine, as compared to the single processor), what fraction of the program can be serial? Use Amdahl’s law. Problem 5. (25 points) We have studied Amdahl's Law, which gives us a limit on the amount of speedup we can expect in our parallel algorithms, relative to the fraction of the algorithm that contains non-parallelizable (serial) code. For the purposes of this problem, we will envision a parallel architecture with a large (effectively infinite) number of processors and a very large shared memory. Every processor will be capable of accessing any part of the shared memory. Instead of considering the communication time separately, for simplicity’s sake, assume that communication time is included with the serial portion of the algorithm. The problem is composed of a portion p that can be performed in parallel, which is equal to (1–s ). To solve the problem on n processors, one can split the parallelizable portion of the problem in half and have more processors take the work. So, With one processor: T = s + (1–s ) On two processors we have the serial work, a split, then another serial section to reassemble the answer to the problem: T = s + (1–s )/2 + s (Assume that the time to split and reassemble the two halves of the problem does not depend on the number of elements in each half.) We can halve the size of the data set on each iteration, doubling the number of processors at our disposal. Each time we do this, we incur an extra step of re-assembly that must be performed in serial. (a) Is there a maximum or a limit to the achievable speedup? Explain how you have reached your conclusion. (b) Now assume that we can divide the data set into more than two fragments with each step. Will this affect the maximum achievable speedup? –2– (c) Comment on the efficiency of this approach. –3–