doc

advertisement
1.
Consider an architecture of n processors partitioned into two disjoint subsets, A
and B, each with n / 2 processors. Further, assume that each processor in A is
joined to each processor in B, but no pair of processors having both members in A
or in B are joined. See Figure 5-28 for an example.
a)
Can fundamental operations be executed on this architecture faster that on
the star-shaped architecture described above? For example, devise an
efficient parallel algorithm for computing a semigroup operation in01 xi ,
where xi is stored in processor Pi , on this architecture, and analyze its
running time.
Solution: For purposes of algorithms and analysis, this architecture may be
thought of as similar to an EREW PRAM, in the sense that many pairs of
processors can simultaneously communicate (1) data per pair in (1) time.
Thus we should expect that fundamental operations can be executed on this
architecture faster than on the star-shaped architecture. We illustrate this assertion
with a fast algorithm for a semigroup computation.
Without loss of generality, we assume A is the set of processors with odd indices
and B is the set of processors with even indices. For simplicity, we will also
assume that n  2 k for some positive integer k.

In parallel, each processor Pi initializes ri  xi . This takes (1) time.
For x  1 to k, do
In parallel, each processor P2 j , j {1,2,,2 k i } , sends r2 j to processor P2 j 1 .
In parallel, each processor P2 j 1 , j {1,2,,2 k i } , computes r2 j 1  r2 j 1  r2 j .
In parallel, each processor P2 j 1 , j {1,2,,2 k i } , sends r2 j 1 to processor Pj .
In parallel, each processor Pj , j {1,2,,2 k i } , computes rj  rj  r2 j 1 .

End For {The totality of the For loop takes ( k )  (log n) time.}
If desired, the result of the semigroup computation (the final value of rn in Pn ) can
be broadcast to all other processors by a recursive doubling process (that mirrors the
recursive halving process used in the previous step) that takes (log n) time.
Thus, this algorithm requires (log n) time. Notice that the total work is (n log n) ,
which is not optimal, since a sequential semigroup computation requires (n) work. By
modifying the algorithm in a fashion similar to that used to obtain an optimal-work
PRAM algorithm for a semigroup computation using a reduced number of processors, we
can obtain an optimal (n) work algorithm for this architecture, using  n / log n
processors with (log n) data per processor, in (log n) time. As in the case of the
PRAM, this is done by having each processor perform a semigroup computation on its
data, then applying the algorithm above to the results (one per processor) of the local
semigroup computations.
b
b)
What is the bisection width of this architecture? What does this imply
about the practicality of this architecture?
g
Solution: Since each of the n / 2 processors in A is connected to each of the n / 2
processors in B, the bisection width is n2/4, which represents each of the
processors in A cutting wires to half of the processors in B. This is too many wires
to be practical to build. However, from the theoretical standpoint of analysis of
algorithms, it is highly desirable, as it makes it likely that algorithms requiring
lots of data movement can be devised to run quickly.
2.
Define an X-tree to be a tree machine in which neighboring nodes on a level are
connected. That is, each interior node has 2 additional links, one to each of its left
and right neighbors. Nodes on the outer edge of the tree (with the exception of
the root) have one additional link, to its neighboring node in its level.
a)
What is the communication diameter of an X-tree? Explain.
Solution: The additional links do not change the (log n) communication
diameter of an “ordinary” tree machine. Roughly, this is because there may be a
linear number of nodes on a level (a tree with n  2 k  1 nodes has 2 k 1 
n 1
2
nodes at its leaf level), so that a shortest path between maximally separated nodes
does not use  (1) of the extra links of the X-tree.
b)
What is the bisection width of an X-tree? Explain.
Solution: An X-tree of n nodes has depth of (log n) . If on each node level, we
cut the wire between the middle pair of nodes, we bisect the X-tree. Therefore,
the bisection width is (log n) .
c)
Give a lower bound on sorting for the X-tree. Explain.
Solution: Suppose the input to the problem is such that every data item is in the
wrong half of the X-tree. Then n data items have to cross the wires of a bisection
width of (log n) . In one time step,  logn data values can cross from the
“wrong half” to the “right half” of the X-tree. Therefore, it will take (n / log n)
parallel time steps for all the data to cross into the correct half of the X-tree.
Therefore, sorting on the X-tree, in the worst case, takes (n / log n) time.
b g
3.
Suppose that you have a CREW PRAM algorithm to solve problem A in (t(n)) time. If
you now consider a solution to this problem on an EREW PRAM, how does the CREW
PRAM algorithm help you in determining a lower bound on the running time to solve this
problem on an EREW PRAM?
Solution: An EREW PRAM may require modification of the algorithm in order to avoid
concurrent reads forbidden to it but permitted to the CREW PRAM. For example, a (1)
time CR operation may have to be replaced on the ER PRAM by a (log n) time
broadcast operation in order to allow the ER machine to simulate a concurrent read.
Generally, then, an EREW PRAM takes at least as much time as a CREW PRAM to
solve a given problem. Thus, (t (n)) time is needed by the EREW PRAM to solve
problem A. Notice also that we have an upper bound of O(t (n) log n) time.
4.
Define a linear array of size n with a bus to be a 1-dimensional mesh of size n augmented
with a single global bus. Every processor is connected to the bus, and in each unit of
time, one processor can write to the bus and all processors can read from the bus (i.e., the
bus is a CREW bus).
a)
Give an efficient algorithm to sum n values, initially distributed one per
processor. Discuss the time and cost of your algorithm.
Solution: In order to improve upon the (n) time with which a linear array without a
bus would solve this problem, we cannot just have each processor use the bus to send its
data to one processor responsible for accumulating the total, as then the receiving
processor would be a serial bottleneck. We present a solution that uses segments of
consecutive processors in the linear array.
For simplicity’s sake, we assume n1/ 2 is an integer. The steps of the algorithm are as
follows.
 In parallel, each k th segment of processors indexed ( k  1)n1/ 2  1,, kn1/ 2 , for


k {1,, n1/ 2 } , uses the linear array algorithm to compute the segment’s partial sum
in its last processor (the processor indexed kn1/ 2 ). Since the segments operate in
parallel, this takes time proportional to the length of the segment, (n1/ 2 ) .
Pn initializes total as the partial sum it holds in (1) time.
For k  1 to n1/ 2  1 , do
Pkn1/2 writes its partial sum to the bus.
Pn reads temp from the bus.
Pn updates total  total  temp


End For. This loop takes (n1/ 2 ) time.
Pn writes total to the bus in (1) time.
In parallel, every processor reads total from the bus in (1) time.
The algorithm clearly takes (n1/ 2 ) time.
b)
Give an efficient algorithm to compute the parallel prefix of n values, initially
distributed one per processor. Discuss the time and cost of your algorithm.
Solution: Our solution shares the philosophy of the previous algorithm in computing the
parallel prefix values within segments of n1/ 2 consecutive processors, and using the bus
to distribute the last prefix of each segment to all processors in the next segment so that
these processors can compute their global prefix value. The steps of the algorithm are as
follows.

In (1) time, each processor determines from its index whether or not it is the last
processor in its segment, according to whether or not it is a multiple of n1/ 2 .


In parallel, each segment of processors uses the linear array algorithm to compute the
parallel prefix of its data values. Since the segments have length (n1/ 2 ) , this takes
(n1/ 2 ) time. Let pi be the prefix value stored in Pi .
For factor 1 to n1/ 2  1 , all processors Pi do in parallel
-
If i  factor  n1/ 2 {i.e., if Pi is the last processor in its segment}, write
( pi , factor ) to the bus
Read ( x, factor ) from bus
If factor  n1/ 2  i  ( factor  1)  n1/ 2 {i.e., if Pi is in the segment following the
segment whose last processor just wrote to the bus}, then pi  x  pi
End parallel, End For. This step takes (n1/ 2 ) time.
Thus, the algorithm takes (n1/ 2 ) time.
Download