CSC/ECE 506: Architecture of Parallel Computers

CSC/ECE 506: Architecture of Parallel Computers
Problem Set 2
Due Wednesday, February 8, 2012
Problems 2, 3, and 5 will be graded. There are 60 points on these problems. Note: You must do
all the problems, even the non-graded ones. If you do not do some of them, half as many points
as they are worth will be subtracted from your score on graded problems.
Problem 1. (16 points; 4 points each) Assume we have a shared-address-space system built on
top of a message-passing multiprocessor, and whenever a page is referenced that is not in local
memory, it is transferred across the network. Assume:
(1) A page consists of 512 thirty-two-bit words.
(2) The bus is 64 bits wide, and runs at 200 MHz.
(3) Before anything can be transferred across the bus, we must wait 2 cycles to arbitrate for the
bus, plus the memory-access time of 30 ns.
(a) When a nonlocal page is referenced, how long does it take before the first word reaches the
requesting processor?
(b) How much additional time does it take for the page to be transferred?
(c) How long does it take to transfer the page?
(d) What is the effective bandwidth of the transfer in megabytes/sec.?
Problem 2. (15 points) [Solihin 3.3] Code analysis. For the code shown below,
for (i=1; i <=N; i++} {
for (j=2; j<=N; j++) { // note the index range
S1: a[i][j] = a[i][j-1] + a[i][j-2];
S2: a[i+1][j] = a[i][j] * b[i-1][j];
S3: b[i][j] = a[i][j];
(a) Draw the iteration-space traversal graph (ITG).
(b) List all the dependences, and clearly indicate which dependence is loop-independent vs. loopcarried.
(c) Draw the loop-carried dependence graph (LDG).
 
Problem 3. (20 points) A radix-2 FFT over n complex numbers is implemented as a sequence of
log n completely parallel steps, requiring 5n log n floating-point operations while reading and
writing each element of data log n times. If this data is spread over the memories in a NUMA
design using either a cycle or block distribution, then log n p of the steps will access data in the
local memory, assuming both n 
are powers of two, and in the remaining log p steps half
andp 
of the reads and half the writes will be
local. (The choice of layout determines which steps are
 but the ratio stays the same.) Calculate the communication-tolocal and which are
computation ratio on a distributed-memory
 where each processor has a local memory, as
in Figure 7.2. What
communication bandwidth would the network need to sustain for the machine
 
to deliver 250 p MFLOPS on p processors?
Problem 4. (24 points) Consider the simple problem of “triangulating” a polygon. To study the
properties of a polygonal surface that depend on each point on that surface, the surface is divided
into n small triangular regions. Each triangular region is an aggregation of points represented by a
single point in the triangle. The complex equations are solved on each such representative point,
and the resulting values at each point are summed to determine the current state of the surface at
that instant. Assume that it takes 1 unit of time per point to solve the equations and 1 unit of time
per point to calculate the sum.
(a) What would be the total cost of computation in a sequential program for this application?
(b) If this application is decomposed into a 2-phase program, describe each phase. What would
be the cost of computation for the serial phase and the parallel phase, total cost of computation
and the maximum achievable speedup taking into consideration the number of processors,
regardless of the processors?
(c) If the application is instead decomposed into a 3-phase program, what would be the answer
to the above question? Instead of determining the maximum achievable speedup regardless of
the processors, please explain what happens to the speedup as the number of processors
increases, assuming n >> p.
(d) Is the “concurrency profile” regular or irregular for the above problem?
(e) Is this a case of “owner computes” and why?
(f) Which of the 4 steps in parallelization is used to ensure consistency and coherence? Is the
partitioning architecture dependent?
(g) For maximum efficiency, how should the n tasks be divided among the p processors, i.e., how
many tasks should each processor be responsible for? Which phase is responsible for this?
Problem 5. Ocean parallelization. (25 points) [Solihin 3.10] Given the code for a single sweep
in Ocean (without the diff computation) —
for (i=1; i <=n; i++} {
for (j=1; j<=n; j++) {
A[i][j] = 0.2 * (A[i][j] + A[i][j-1] + A[i][j+1]
+ A[i-1][j] + A[i+1][j])
(a) Rewrite the code so that in the new loop nest, each iteration of one of the loops is a parallel
task. Exploit parallelism that exists on each anti-diagonal. Then, highlight the loop that is parallel
in your new code.
(b) Rewrite the code to exploit parallelism using red-black partitioning. Identify the parallel loops.