CSC/ECE 506: Architecture of Parallel Computers

advertisement
CSC/ECE 506: Architecture of Parallel Computers
Problem Set 1
Due Wednesday, January 25, 2012
Problems 1, 2, and 4 will be graded. There are 55 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on the graded problems.
Problem 1. (15 points) In message-passing models, each process is provided with a special
variable or function that gives its unique number or rank among the set of processes executing
the program. Most shared memory programming systems provide a fetch&inc (fetch and
increment) operation, which reads the value of a location and atomically increments the value at
the location.
(a) Write a little pseudocode the show how this fetch&inc instruction can be used to assign each
process a unique number. Assume that the number of processes is known, and is stored in the
variable number_procs, and that each process waits till all processes have been assigned a
rank before proceeding forward. Comment your pseudocode to clearly indicate the variable that
holds the rank, and the variable on which the fetch&inc is performed. (b) If the number of
processes is not known, can you determine it in a manner similar to part (a) above?
Problem 2. (25 points) A parallel computation on an n-processor system can be characterized
by a pair P(n), T(n), where P(n) is the total number of instructions executed by all the
processors and T(n) is the elapsed execution time for the entire system (measured in number of
instructions). (You may assume that all instructions take the same amount of time.) P(n), for
n > 1, may be greater than P(1) because some of the processors have to do extra “redundant”
work to synchronize or avoid excessive memory contention. However, assume that P(n) is never
less than P(1).
In a serial computation, all instructions are performed by a single processor, so P (1) = T (1).
Usually, for n > 1, T(n) < P(n) because the computation will finish faster on a multiprocessor.
Lee (1980) suggested five performance indices for comparing a parallel computation with a serial
computation.
T(1)
S(n) = T(n)
(The speedup)
T(1)
E(n) = nT(n)
(The efficiency )
P(n)
R(n) = P(1)
(The redundancy)
P(n)
U(n) = n T(n)
(The utilization)
Q (n) =
T 3(1)
n T 2(n) P(n)
(The quality)
Note: T 2(n ) = T (n ) T (n )
(a) Prove that the following relationships hold in all possible comparisons of parallel to serial
computations:
(1) 1 ≤ S (n ) ≤ n
(3) U (n ) = R (n ).E (n )
–1–
(2) E (n ) = S (n )
n
(4) Q (n ) = S (n ).E (n )
R (n )
(b) Based on the above definitions and relationships, give physical meanings of these
performance indices.
Problem 3. (15 points) Below is a section of code that subtracts two N x N matrices, and sets
any negative values in the results to zero. The psuedocode for this is as follows:
for (i = 1 to n) do
C[i] = A[i] – B [i ] // (entire row)
for_all C[i] < 0 set C[i ] =0
end for
Using the array-processor instructions from Lecture 3 to set appropriate mask bits, implement the
psuedocode above. You may assume that there is a row of size n containing all –1's loaded into
location P in memory, and a row of all 0's into row Q in memory.
Problem 4. (15 points) Suppose a program that was being run on one processor is now run on
a 100-processor machine. If a speedup of 80 is desired (on the 100-processor machine, as
compared to the single processor), what fraction of the program can be serial? Use Amdahl’s
law.
Problem 5. (25 points) We have studied Amdahl's Law, which gives us a limit on the amount of
speedup we can expect in our parallel algorithms, relative to the fraction of the algorithm that
contains non-parallelizable (serial) code. For the purposes of this problem, we will envision a
parallel architecture with a large (effectively infinite) number of processors and a very large
shared memory. Every processor will be capable of accessing any part of the shared memory.
Instead of considering the communication time separately, for simplicity’s sake, assume that
communication time is included with the serial portion of the algorithm.
The problem is composed of a portion p that can be performed in parallel, which is equal to (1–s ).
To solve the problem on n processors, one can split the parallelizable portion of the problem in
half and have more processors take the work. So,
With one processor:
T = s + (1–s )
On two processors we have the serial work, a split, then another serial section to reassemble the
answer to the problem:
T = s + (1–s )/2 + s
(Assume that the time to split and reassemble the two halves of the problem does not depend on
the number of elements in each half.)
We can halve the size of the data set on each iteration, doubling the number of processors at our
disposal. Each time we do this, we incur an extra step of re-assembly that must be performed in
serial.
(a) Is there a maximum or a limit to the achievable speedup? Explain how you have reached
your conclusion.
(b) Now assume that we can divide the data set into more than two fragments with each step. Will
this affect the maximum achievable speedup?
–2–
(c) Comment on the efficiency of this approach.
–3–
Download