Exam 2 Soultions

advertisement
CIS6930
Parallel Computing
Fall 2006
Exam # 2
Name: __________________________________________
UFID: ____________ - ____________
E-mail: _________________________________________
Instructions:
1. Write neatly and legibly.
2. While grading, not only your final answer but also your
approach to the problem will be evaluated.
3. You have to attempt all the problems (100 points).
4. Total time for the exam is 120 minutes.
5. When deriving expressions for runtime, you may like to
detail all the appropriate steps. Otherwise, no partial credit
will be awarded for incorrect expressions.
I have read carefully, and have understood the above
instructions. On my honor, I have neither given nor
received unauthorized aid on this examination.
Signature: _____________________________________
Date: ____ (MM) / ____ (DD) / ___________ (YYYY)
1
Question 1 (30 points)
A sample sort is an improved version of bucket sort and can be described as follows:
A sample of size s is selected from the n-element sequence, and the range of the
buckets is determined by sorting the sample and choosing m-1 elements from the
result. These elements (called splitters) divide the sample into m equal-sized buckets.
After defining the buckets, the algorithm proceeds in the same way as bucket sort.
The performance of sample sort depends on the sample size s and the way it is
selected from the n-element sequence. Consider a splitter selection scheme that
guarantees that the number of elements ending up in each bucket is roughly the same
for all buckets. Let n be the number of elements to be sorted and m be the number of
buckets. The scheme works as follows. It divides the n elements into m blocks of size
n/m each, and sorts each block by using quicksort. From each sorted block it chooses
m-1 evenly spaced elements. The m(m-1) elements selected from all the blocks
represent the sample used to determine the buckets.
1) Derive the parallel formulation of the splitter selection scheme on a p-processor
computer. Assume the number of buckets is selected to be m = p. (15pts)
2) Derive the time complexity of this scheme on a message-passing computer with p
process and O(p) bisection bandwidth. (15pts)
Answer:
1) Initially, each process is assigned a block of n/p elements, which it sorts sequentially.
It then chooses p-1 evenly spaced elements from the sorted block. Each process sends its
p-1 sample elements to one process – say P0. Process P0 then sequentially sorts the p(p-1)
sample elements and selects the p-1 splitters. Finally, process P0 broadcasts the p-1
splitters to all the other processes to do the bucket sort.
2) The internal sort of n/p elements requires time (n / p log( n / p)), and the selection of
p-1 sample elements requires time  ( p ) . Sending p-1 elements to process P0 is similar to
a gather operation; the time required is ( p 2 ) . The time to internally sort the p(p-1)
sample elements at P0 is ( p 2 log p) , and the time to select p-1 splitters is  ( p ) . The
parallel run time of splitter selection scheme is TP  (n / p log( n / p))  ( p 2 log p)
2
Question 2 (40 points)
1) (10 points) Prove that if a sorting network sorts every sequence of 0's and 1's, then
it sorts every arbitrary sequence of values.
2) (15 points) A bitonic sequence is a sequence of elements <a0, a1, …, an-1> with
the property that there exists an index i, 0  i  n  1, such that <a0, …, ai> is
monotonically increasing and <ai+1, …, an-1> is monotonically decreasing. A
bitonic split operation splits a bitonic sequence s of size n into the two bitonic
sequences, s1 and s2, defined by equation below:
s1 = <min{a0, an/2}, min{a1, an/2+1}, …, min{an/2-1, an-1}>
s2 = <max{a0, an/2}, max{a1, an/2+1}, …, max{an/2-1, an-1}>
Show how you can use the bitonic split to sort the following bitonic sequence [2 5
6 7 8 5 3 1].
3) (15 points) Develop a parallel sorting algorithm that uses a parallel algorithm for
sorting bitonic sequences to sort arbitrary (non-bitonic) sequences.
Answer:
1) Proof: Let N denote the sorting network. Suppose a with ai A is an arbitrary
sequence which is not sorted by N. This means N(a) = b is unsorted, i.e. there is a
position k such that bk > bk+1.
Now define a mapping f : A
f(c) =
0
1
{0, 1} as follows. For all c
A let
if c < bk
if c bk
Obviously, f is monotonic. Moreover we have:
f(bk) = 1 and f(bk+1) = 0
i.e. f(b) = f(N(a)) is unsorted.
This means that N(f(a)) is unsorted or, in other words, that the 0-1-sequence f(a) is not
sorted by the comparator network N.
We have shown that, if there is an arbitrary sequence a that is not sorted by N, then there
is a 0-1-sequence f(a) that is not sorted by N.
Equivalently, if there is no 0-1-sequence that is not sorted by N, then there can be no
sequence a whatsoever that is not sorted by N.
Equivalently again, if all 0-1-sequences are sorted by N, then all arbitrary sequences are
sorted by N.
2)
See Figure 9.6 in the textbook for an example.
3
2 5 6 7 8 5 3 1
Step 1: 2
5 3 1| 8 5 6 7
Step 2: 2 1 | 3 5 | 6 5 | 8 7
Step 3: 1 | 2 | 3 | 5 | 5| 6 | 7 | 8
3) See Section 9.2.1 in the textbook
4
Question 3 (30 points)
Consider the simplified version of the polygon-triangulation problem:
Given a simple polygon < v0, v1 …, vn-1>, break the polygon into a set of triangles
by connecting nodes of the polygon with chords. A possible triangulation is
illustrated in the figure below. The cost of constructing a triangle with nodes vi, vj
and vk is defined by a function f(vi, vj, vk). For this problem, let the cost be the
total length of the edges of the triangle (using Euclidean distance).
The optimal polygon-triangulation problem breaks up a polygon into a set of triangles
such that the total length of each triangle (the sum of the individual lengths) is minimized.
Define C[i, j] as the weight of an optimal triangulation of vertices <vi-1, …, vj>. Here is a
recursive equation that can be used to determine C[i, j]:
min i k  j {C[i, k ]  C[k  1, j ]  f (vi 1 , v k , v j )} i  j
C[i, j ]  
i j
0
The objective is to determine C[1, n-1].
1) Derive a parallel formulation for p processing elements on a hypercube. (20pts)
2) Determine its parallel run time. (10pts)
Answer:
1) A parallel formulation can be derived as follows. We use a bottom-up approach for
constructing the table C that stores the value of C[i, j]. The algorithm fills the table
diagonally. An example is shown in the figure below. Entries in diagonal l correspond
to the cost of triangulation of a polygon with l +1 nodes. Consider the parallel
formulation of this algorithm on a hypercube with p (1 <= p <= n) processing
elements. If there are n nodes in a diagonal, each processing element stores n/p nodes.
Each processing element computes the cost C[i, j] of the entries assigned to it. After
computation, an all-to-all broadcast sends the solution costs of the sub-problems for
the most recently computed diagonal to all the other processing elements.
2) The time taken for all-to-all broadcast of n/p words is
t s log p  t w n( p  1) / p  t s log p  t w n . The time to compute n/p entries of the table
in the lth diagonal is lt c n / p , t c denote the time to compute the term
C[i, k ]  C[k  1, j ]  f (vi 1 , vk , v j ) . The parallel run time is
5
n 1
TP   (lt c n / p  t s log p  t w n) 
l 1
n 2 (n  1)
t c  t s (n  1) log p  t w n(n  1)
2p
6
Summary of communication times of various operations discussed in the textbook on an
interconnection network with cut-through routing.
Operation
Ring
Mesh
Hypercube Time
One-to-all
min(( t s  t w m) log p,
(t s  t w m) log p
(t s  t w m) log p
broadcast
2(t s log p  t w m))
All-to-one
reduction
All-to-all
(t s  t w m)( p  1)
2t s ( p  1)  t w m( p  1) t s log p  t w m( p  1)
broadcast,
All-to-all
reduction
All-reduce
n/a
n/a
min(( t s  t w m) log p,
2(t s log p  t w m))
Scatter,
n/a
Gather
All-to-all
(t s  t w mp / 2)( p  1)
personalized
Circular
n/a
Shift
n/a
t s log p  t w m( p  1)
(2t s  t w mp)( p  1)
(t s  t w m)( p  1)
(t s  t w m)( p  1)
ts  twm
7
Download