CIS6930 Parallel Computing Fall 2006 Exam # 2 Name: __________________________________________ UFID: ____________ - ____________ E-mail: _________________________________________ Instructions: 1. Write neatly and legibly. 2. While grading, not only your final answer but also your approach to the problem will be evaluated. 3. You have to attempt all the problems (100 points). 4. Total time for the exam is 120 minutes. 5. When deriving expressions for runtime, you may like to detail all the appropriate steps. Otherwise, no partial credit will be awarded for incorrect expressions. I have read carefully, and have understood the above instructions. On my honor, I have neither given nor received unauthorized aid on this examination. Signature: _____________________________________ Date: ____ (MM) / ____ (DD) / ___________ (YYYY) 1 Question 1 (30 points) A sample sort is an improved version of bucket sort and can be described as follows: A sample of size s is selected from the n-element sequence, and the range of the buckets is determined by sorting the sample and choosing m-1 elements from the result. These elements (called splitters) divide the sample into m equal-sized buckets. After defining the buckets, the algorithm proceeds in the same way as bucket sort. The performance of sample sort depends on the sample size s and the way it is selected from the n-element sequence. Consider a splitter selection scheme that guarantees that the number of elements ending up in each bucket is roughly the same for all buckets. Let n be the number of elements to be sorted and m be the number of buckets. The scheme works as follows. It divides the n elements into m blocks of size n/m each, and sorts each block by using quicksort. From each sorted block it chooses m-1 evenly spaced elements. The m(m-1) elements selected from all the blocks represent the sample used to determine the buckets. 1) Derive the parallel formulation of the splitter selection scheme on a p-processor computer. Assume the number of buckets is selected to be m = p. (15pts) 2) Derive the time complexity of this scheme on a message-passing computer with p process and O(p) bisection bandwidth. (15pts) Answer: 1) Initially, each process is assigned a block of n/p elements, which it sorts sequentially. It then chooses p-1 evenly spaced elements from the sorted block. Each process sends its p-1 sample elements to one process – say P0. Process P0 then sequentially sorts the p(p-1) sample elements and selects the p-1 splitters. Finally, process P0 broadcasts the p-1 splitters to all the other processes to do the bucket sort. 2) The internal sort of n/p elements requires time (n / p log( n / p)), and the selection of p-1 sample elements requires time ( p ) . Sending p-1 elements to process P0 is similar to a gather operation; the time required is ( p 2 ) . The time to internally sort the p(p-1) sample elements at P0 is ( p 2 log p) , and the time to select p-1 splitters is ( p ) . The parallel run time of splitter selection scheme is TP (n / p log( n / p)) ( p 2 log p) 2 Question 2 (40 points) 1) (10 points) Prove that if a sorting network sorts every sequence of 0's and 1's, then it sorts every arbitrary sequence of values. 2) (15 points) A bitonic sequence is a sequence of elements <a0, a1, …, an-1> with the property that there exists an index i, 0 i n 1, such that <a0, …, ai> is monotonically increasing and <ai+1, …, an-1> is monotonically decreasing. A bitonic split operation splits a bitonic sequence s of size n into the two bitonic sequences, s1 and s2, defined by equation below: s1 = <min{a0, an/2}, min{a1, an/2+1}, …, min{an/2-1, an-1}> s2 = <max{a0, an/2}, max{a1, an/2+1}, …, max{an/2-1, an-1}> Show how you can use the bitonic split to sort the following bitonic sequence [2 5 6 7 8 5 3 1]. 3) (15 points) Develop a parallel sorting algorithm that uses a parallel algorithm for sorting bitonic sequences to sort arbitrary (non-bitonic) sequences. Answer: 1) Proof: Let N denote the sorting network. Suppose a with ai A is an arbitrary sequence which is not sorted by N. This means N(a) = b is unsorted, i.e. there is a position k such that bk > bk+1. Now define a mapping f : A f(c) = 0 1 {0, 1} as follows. For all c A let if c < bk if c bk Obviously, f is monotonic. Moreover we have: f(bk) = 1 and f(bk+1) = 0 i.e. f(b) = f(N(a)) is unsorted. This means that N(f(a)) is unsorted or, in other words, that the 0-1-sequence f(a) is not sorted by the comparator network N. We have shown that, if there is an arbitrary sequence a that is not sorted by N, then there is a 0-1-sequence f(a) that is not sorted by N. Equivalently, if there is no 0-1-sequence that is not sorted by N, then there can be no sequence a whatsoever that is not sorted by N. Equivalently again, if all 0-1-sequences are sorted by N, then all arbitrary sequences are sorted by N. 2) See Figure 9.6 in the textbook for an example. 3 2 5 6 7 8 5 3 1 Step 1: 2 5 3 1| 8 5 6 7 Step 2: 2 1 | 3 5 | 6 5 | 8 7 Step 3: 1 | 2 | 3 | 5 | 5| 6 | 7 | 8 3) See Section 9.2.1 in the textbook 4 Question 3 (30 points) Consider the simplified version of the polygon-triangulation problem: Given a simple polygon < v0, v1 …, vn-1>, break the polygon into a set of triangles by connecting nodes of the polygon with chords. A possible triangulation is illustrated in the figure below. The cost of constructing a triangle with nodes vi, vj and vk is defined by a function f(vi, vj, vk). For this problem, let the cost be the total length of the edges of the triangle (using Euclidean distance). The optimal polygon-triangulation problem breaks up a polygon into a set of triangles such that the total length of each triangle (the sum of the individual lengths) is minimized. Define C[i, j] as the weight of an optimal triangulation of vertices <vi-1, …, vj>. Here is a recursive equation that can be used to determine C[i, j]: min i k j {C[i, k ] C[k 1, j ] f (vi 1 , v k , v j )} i j C[i, j ] i j 0 The objective is to determine C[1, n-1]. 1) Derive a parallel formulation for p processing elements on a hypercube. (20pts) 2) Determine its parallel run time. (10pts) Answer: 1) A parallel formulation can be derived as follows. We use a bottom-up approach for constructing the table C that stores the value of C[i, j]. The algorithm fills the table diagonally. An example is shown in the figure below. Entries in diagonal l correspond to the cost of triangulation of a polygon with l +1 nodes. Consider the parallel formulation of this algorithm on a hypercube with p (1 <= p <= n) processing elements. If there are n nodes in a diagonal, each processing element stores n/p nodes. Each processing element computes the cost C[i, j] of the entries assigned to it. After computation, an all-to-all broadcast sends the solution costs of the sub-problems for the most recently computed diagonal to all the other processing elements. 2) The time taken for all-to-all broadcast of n/p words is t s log p t w n( p 1) / p t s log p t w n . The time to compute n/p entries of the table in the lth diagonal is lt c n / p , t c denote the time to compute the term C[i, k ] C[k 1, j ] f (vi 1 , vk , v j ) . The parallel run time is 5 n 1 TP (lt c n / p t s log p t w n) l 1 n 2 (n 1) t c t s (n 1) log p t w n(n 1) 2p 6 Summary of communication times of various operations discussed in the textbook on an interconnection network with cut-through routing. Operation Ring Mesh Hypercube Time One-to-all min(( t s t w m) log p, (t s t w m) log p (t s t w m) log p broadcast 2(t s log p t w m)) All-to-one reduction All-to-all (t s t w m)( p 1) 2t s ( p 1) t w m( p 1) t s log p t w m( p 1) broadcast, All-to-all reduction All-reduce n/a n/a min(( t s t w m) log p, 2(t s log p t w m)) Scatter, n/a Gather All-to-all (t s t w mp / 2)( p 1) personalized Circular n/a Shift n/a t s log p t w m( p 1) (2t s t w mp)( p 1) (t s t w m)( p 1) (t s t w m)( p 1) ts twm 7