doc

1. Consider an architecture of n processors partitioned into two disjoint subsets, A and B, each with n / 2 processors. Further, assume that each processor in A is joined to each processor in B, but no pair of processors having both members in A or in B are joined. See Figure 5-28 for an example. a) Can fundamental operations be executed on this architecture faster that on the star-shaped architecture described above? For example, devise an efficient parallel algorithm for computing a semigroup operation in01 xi , where xi is stored in processor Pi , on this architecture, and analyze its running time. Solution: For purposes of algorithms and analysis, this architecture may be thought of as similar to an EREW PRAM, in the sense that many pairs of processors can simultaneously communicate (1) data per pair in (1) time. Thus we should expect that fundamental operations can be executed on this architecture faster than on the star-shaped architecture. We illustrate this assertion with a fast algorithm for a semigroup computation. Without loss of generality, we assume A is the set of processors with odd indices and B is the set of processors with even indices. For simplicity, we will also assume that n  2 k for some positive integer k.  In parallel, each processor Pi initializes ri  xi . This takes (1) time. For x  1 to k, do In parallel, each processor P2 j , j {1,2,,2 k i } , sends r2 j to processor P2 j 1 . In parallel, each processor P2 j 1 , j {1,2,,2 k i } , computes r2 j 1  r2 j 1  r2 j . In parallel, each processor P2 j 1 , j {1,2,,2 k i } , sends r2 j 1 to processor Pj . In parallel, each processor Pj , j {1,2,,2 k i } , computes rj  rj  r2 j 1 .  End For {The totality of the For loop takes ( k )  (log n) time.} If desired, the result of the semigroup computation (the final value of rn in Pn ) can be broadcast to all other processors by a recursive doubling process (that mirrors the recursive halving process used in the previous step) that takes (log n) time. Thus, this algorithm requires (log n) time. Notice that the total work is (n log n) , which is not optimal, since a sequential semigroup computation requires (n) work. By modifying the algorithm in a fashion similar to that used to obtain an optimal-work PRAM algorithm for a semigroup computation using a reduced number of processors, we can obtain an optimal (n) work algorithm for this architecture, using  n / log n processors with (log n) data per processor, in (log n) time. As in the case of the PRAM, this is done by having each processor perform a semigroup computation on its data, then applying the algorithm above to the results (one per processor) of the local semigroup computations. b b) What is the bisection width of this architecture? What does this imply about the practicality of this architecture? g Solution: Since each of the n / 2 processors in A is connected to each of the n / 2 processors in B, the bisection width is n2/4, which represents each of the processors in A cutting wires to half of the processors in B. This is too many wires to be practical to build. However, from the theoretical standpoint of analysis of algorithms, it is highly desirable, as it makes it likely that algorithms requiring lots of data movement can be devised to run quickly. 2. Define an X-tree to be a tree machine in which neighboring nodes on a level are connected. That is, each interior node has 2 additional links, one to each of its left and right neighbors. Nodes on the outer edge of the tree (with the exception of the root) have one additional link, to its neighboring node in its level. a) What is the communication diameter of an X-tree? Explain. Solution: The additional links do not change the (log n) communication diameter of an “ordinary” tree machine. Roughly, this is because there may be a linear number of nodes on a level (a tree with n  2 k  1 nodes has 2 k 1  n 1 2 nodes at its leaf level), so that a shortest path between maximally separated nodes does not use  (1) of the extra links of the X-tree. b) What is the bisection width of an X-tree? Explain. Solution: An X-tree of n nodes has depth of (log n) . If on each node level, we cut the wire between the middle pair of nodes, we bisect the X-tree. Therefore, the bisection width is (log n) . c) Give a lower bound on sorting for the X-tree. Explain. Solution: Suppose the input to the problem is such that every data item is in the wrong half of the X-tree. Then n data items have to cross the wires of a bisection width of (log n) . In one time step,  logn data values can cross from the “wrong half” to the “right half” of the X-tree. Therefore, it will take (n / log n) parallel time steps for all the data to cross into the correct half of the X-tree. Therefore, sorting on the X-tree, in the worst case, takes (n / log n) time. b g 3. Suppose that you have a CREW PRAM algorithm to solve problem A in (t(n)) time. If you now consider a solution to this problem on an EREW PRAM, how does the CREW PRAM algorithm help you in determining a lower bound on the running time to solve this problem on an EREW PRAM? Solution: An EREW PRAM may require modification of the algorithm in order to avoid concurrent reads forbidden to it but permitted to the CREW PRAM. For example, a (1) time CR operation may have to be replaced on the ER PRAM by a (log n) time broadcast operation in order to allow the ER machine to simulate a concurrent read. Generally, then, an EREW PRAM takes at least as much time as a CREW PRAM to solve a given problem. Thus, (t (n)) time is needed by the EREW PRAM to solve problem A. Notice also that we have an upper bound of O(t (n) log n) time. 4. Define a linear array of size n with a bus to be a 1-dimensional mesh of size n augmented with a single global bus. Every processor is connected to the bus, and in each unit of time, one processor can write to the bus and all processors can read from the bus (i.e., the bus is a CREW bus). a) Give an efficient algorithm to sum n values, initially distributed one per processor. Discuss the time and cost of your algorithm. Solution: In order to improve upon the (n) time with which a linear array without a bus would solve this problem, we cannot just have each processor use the bus to send its data to one processor responsible for accumulating the total, as then the receiving processor would be a serial bottleneck. We present a solution that uses segments of consecutive processors in the linear array. For simplicity’s sake, we assume n1/ 2 is an integer. The steps of the algorithm are as follows.  In parallel, each k th segment of processors indexed ( k  1)n1/ 2  1,, kn1/ 2 , for   k {1,, n1/ 2 } , uses the linear array algorithm to compute the segment’s partial sum in its last processor (the processor indexed kn1/ 2 ). Since the segments operate in parallel, this takes time proportional to the length of the segment, (n1/ 2 ) . Pn initializes total as the partial sum it holds in (1) time. For k  1 to n1/ 2  1 , do Pkn1/2 writes its partial sum to the bus. Pn reads temp from the bus. Pn updates total  total  temp   End For. This loop takes (n1/ 2 ) time. Pn writes total to the bus in (1) time. In parallel, every processor reads total from the bus in (1) time. The algorithm clearly takes (n1/ 2 ) time. b) Give an efficient algorithm to compute the parallel prefix of n values, initially distributed one per processor. Discuss the time and cost of your algorithm. Solution: Our solution shares the philosophy of the previous algorithm in computing the parallel prefix values within segments of n1/ 2 consecutive processors, and using the bus to distribute the last prefix of each segment to all processors in the next segment so that these processors can compute their global prefix value. The steps of the algorithm are as follows.  In (1) time, each processor determines from its index whether or not it is the last processor in its segment, according to whether or not it is a multiple of n1/ 2 .   In parallel, each segment of processors uses the linear array algorithm to compute the parallel prefix of its data values. Since the segments have length (n1/ 2 ) , this takes (n1/ 2 ) time. Let pi be the prefix value stored in Pi . For factor 1 to n1/ 2  1 , all processors Pi do in parallel - If i  factor  n1/ 2 {i.e., if Pi is the last processor in its segment}, write ( pi , factor ) to the bus Read ( x, factor ) from bus If factor  n1/ 2  i  ( factor  1)  n1/ 2 {i.e., if Pi is in the segment following the segment whose last processor just wrote to the bus}, then pi  x  pi End parallel, End For. This step takes (n1/ 2 ) time. Thus, the algorithm takes (n1/ 2 ) time.

doc

Related documents

Products

Support

doc

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib