RTF - NCSU COE People

advertisement
CSC/ECE 506: Architecture of Parallel Computers
Summer 2003 - Problem Set 1
Problems 1, 4 and 5 will be graded. There are 60 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on the graded problems.
Problem 1. (25 points)
Consider a computer with four processors P1, P2,
P3, and P4 and six memory modules M1, M2 …. M6.
The four processors can be configured as an MIMD
machine or an SIMD machine. The processors
execute six instructions with precedence constraints
illustrated in the diagram at the right. Each time a
processor executes an instruction, it references one
memory module, as given in the table below. It
references this memory module for three consecutive
memory cycles (to fetch the two operands and store
the result). In SIMD mode, instruction i must be
completely finished (by all processors) before
instruction i+1 can begin. In MIMD mode, each
processor can work on its instructions independently of
the other processors, but it must still obey the
precedence constraints for its own instructions. Also,
it cannot execute more than one instruction at a time.
I
I
1
I
2
I
5
I
3
I
4
6
Instructions
Processors
P1
P2
P3
P4
I1
M2
M4
M3
M2
I2
M1
M3
M2
M6
I3
M4
M6
M5
M6
I4
M3
M4
M4
M2
I5
M2
M2
M2
M1
I6
M1
M5
M5
M1
(a) What is the average memory bandwidth (words/cycle) for this program in MIMD mode?
(b) What is the average memory bandwidth (words/cycle) for this program in SIMD mode?
Problem 2. (25 points) Sometimes seemingly trivial problems are not well adapted to an SIMD
processor when they are coded in a straightforward fashion. Consider the array processor
discussed in class.
–1–
(a) Write an assembly-language program for it to find the transpose of an N N matrix. That is,
given an N N matrix A, find B such that B = AT. Comment your program. In addition to the
instructions presented in class, you may also use the following.
RRL
RRR
i, j
i, j
Rotate index register i circularly left by j bits.
Rotate index register i circularly right by j bits.
…
A [0, N –1]
A [1, N –1]
A [1, 0]
…
A [1, N –2]
A [2, N –2]
A [2, N –1]
…
A [2, N –3]
…
A [ N –1, 0]
A [N –1, 1]
A [N –1, 2]
…
A [0, 1]
…
A [0, 0]
…
(b) How could your program be improved if
skewed storage were used for the array A ?
That is, assume that the elements of A
were stored as shown at the right. You
need not actually code the program; just
explain how it would be better than the
program for part (a). Assume that each
ALU has its own private index registers, in
addition to those in the control unit.
Problem 3. (25 points) Consider the simple problem of “triangulating” a polygon. To study the
properties of a polygonal surface that depend on each point on that surface, the surface is divided
into n small triangular regions. Each triangular region is an aggregation of points represented by a
single point in the triangle. The complex equations are solved on each such representative point,
and the resulting values at each point are summed to determine the current state of the surface at
that instant. Assume that it takes 1 unit of time per point to solve the equations and 1 unit of time
per point to calculate the sum.
(a) What would be the total cost of computation in a sequential program for this application?
(b) If this application is decomposed into a 2-phase program, describe each phase. What would
be the cost of computation for the serial phase and the parallel phase, total cost of computation
and the maximum achievable speedup taking into consideration the number of processors,
regardless of the processors?
(c) If the application is instead decomposed into a 3-phase program, what would be the answer to
the above question? Instead of determining the maximum achievable speedup regardless of the
processors, please explain what happens to the speedup as the number of processors increases,
assuming n >> p.
Problem 4. (10 points) Which of our case-study applications (Ocean, Barnes-Hut, Raytrace, and
Data Mining) do you think are amenable to decomposing data rather than computation, and using
an owner-computes rule in parallelization? What do you think the problem(s) would be with using
a strict data-distribution and owner-computes rule in the others?
Problem 5. (25 points) A parallel computation on an n-processor system can be characterized
by a pair P(n), T(n), where P(n) is the total number of instructions executed by all the
processors and T(n) is the elapsed execution time for the entire system (measured in number of
–2–
instructions). (You may assume that all instructions take the same amount of time.) P(n), for
n > 1, may be greater than P(1) because some of the processors have to do extra “redundant”
work to synchronize or avoid excessive memory contention. However, assume that P(n) is never
less than P(1).
In a serial computation, all instructions are performed by a single processor, so P (1) = T (1).
Usually, for n > 1, T(n) < P(n) because the computation will finish faster on a multiprocessor.
Lee (1980) has suggested five performance indices for comparing a parallel computation with a
serial computation.
T(1)
S(n) = T(n)
(The speedup)
T(1)
E(n) = nT(n)
(The efficiency )
P(n)
R(n) = P(1)
(The redundancy)
P(n)
U(n) = n T(n)
(The utilization)
Q (n) =
T 3(1)
n T 2(n) P(n)
(The quality)
Note: T 2(n ) = T (n ).T (n )
(a) Prove that the following relationships hold in all possible comparisons of parallel to serial
computations:
(1) 1 ≤ S (n ) ≤ n
(3) U (n ) = R (n ).E (n )
(2) E (n ) = S (n )
n
(4) Q (n ) = S (n ).E (n )
R (n )
(b) Based on the above definitions and relationships, give physical meanings of these
performance indices.
–3–
Download