Sorting

advertisement
Parallel Sorting
Sathish Vadhiyar
Sorting
 Sorting n keys over p processors
 Sort and move the keys to the appropriate
processor so that every key on processor k
is larger than every key on processor k-1
 The number of keys on any processor should
not be larger than (n/p + thres)
 Communication-intensive due to large
migration of data between processors
Bitonic Sort
 One of the traditional algorithms for
parallel sorting
 Follows a divide-and-conquer algorithm
 Also has nice properties – only a pair of
processors communicate at each stage
 Can be mapped efficiently to hypercube
and mesh networks
Bitonic Sequence
 Rearranges a bitonic sequence into a sorted
sequence
 Bitonic sequence – sequence of elements
(a0,a1,a2,…,an-1) such that
a0
ai
an-1
 Or there exists a cyclic shift of indices
satisfying the above
 E.g.: (1,2,4,7,6,0) or (8,9,2,1,0,4)
Using bitonic sequence for
sorting
 Let s = (a0,a1,…,an-1) be a bitonic
sequence such that a0<=a1<=…<=an/2-1 and
an/2>=an/2+1>=…>=an-1
 Consider
S1 = (min(a0,an/2),min(a1,an/2+1),….,min(an/2-1,an-1)) and
S2 = (max(a0,an/2),max(a1,an/2+1),….,max(an/2-1,an-1))
Both are bitonic sequences
Every element of s1 is smaller than s2
Using bitonic sequence for
sorting
 Thus, initial problem of rearranging a
bitonic sequence of size n is reduced to
problem of rearranging two smaller bitonic
sequences and concatenating the results
 This operation of splitting is bitonic split
 This is done recursively until the size is 1 at
which point the sequence is sorted; number
of splits is logn
 This procedure of sorting a bitonic
sequence using bitonic splits is called
bitonic merge
Bitonic Merging Network
3
5
+
8
3
+
9
5
+
10
8
+
12
+
10
+
20
90
60
40
35
23
18
0
+
+
12
+
+
14
+
95
+
60
+
+
+
0
90
+
3
9
14
95
+
+
40
+
35
+
23
+
18
+
20
+
+
5
+
8
+
0
10
12
+
14
+
9
35
23
+
18
+
20
95
90
+
60
+
40
+
+
+
+
+
+
+
+
3
+
+
+
0
8
5
10
9
14
+
12
+
20
+
+
+
18
35
23
60
40
95
90
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
0
3
5
8
9
10
12
14
18
20
23
35
40
60
90
95
Takes a bitonic sequence and outputs sorted order; contains logn columns
A bitonic merging network with n inputs denoted as + BM[n]
Sorting unordered n elements
 By repeatedly merging bitonic sequences
of increasing length
+BM[2]
- BM[2]
+BM[2]
- BM[2]
+BM[2]
- BM[2]
+BM[2]
- BM[2]
+BM[4]
+BM[8]
- BM[4]
+BM[16]
+BM[4]
- BM[8]
- BM[4]
•An unsorted
sequence can be
viewed as a
concactenation of
bitonic sequences
of size two
•Each stage merges
adjancent bitonic
sequences into
increasing and
decreasing order
•Forming a larger
bitonic sequence
Bitonic Sort
 Eventually obtain a bitonic sequence of
size n which can be merged into a sorted
sequence
 Figure 9.8 in your book
 Total number of stages, d(n) =
d(n/2)+logn = O(log2n)
 Total time complexity = O(nlog2n)
Parallel Bitonic Sort
Mapping to a Hypercube
 Imagine N processes (one element per
process).
 Each process id can be mapped to the
corresponding node number of the
hypercube.
 Communications between processes for
compare-exchange operations will always be
neighborhood communications
 In the ith step of the final stage, processes
communicate along the (d-(i-1))th dimension
 Figure 9.9 in the book
Parallel Bitonic Sort
Mapping to a Mesh
 Connectivity of a mesh is lower than that
of hypercube
 One mapping is row-major shuffled
mapping
0
1
4
5
2
3
6
7
8
9
12
13
10
11
14
15
 Processes that do frequent compareexchanges are located closeby
Mesh..
 For example, processes that perform
compare-exchange during every stage of
bitonic sort are neighbors
0
1
4
5
2
3
6
7
8
9
12
13
10
11
14
15
Block of Elements per Process
General
3
5
+
8
3
+
9
5
+
10
8
+
12
+
10
+
20
90
60
40
35
23
18
0
+
+
12
+
+
14
+
95
+
60
+
+
+
0
90
+
3
9
14
95
+
+
40
+
35
+
23
+
18
+
20
+
+
5
+
8
+
0
10
12
+
14
+
9
35
23
+
18
+
20
95
90
+
60
+
40
+
+
+
+
+
+
+
+
3
+
+
+
0
8
5
10
9
14
+
12
+
20
+
+
+
18
35
23
60
40
95
90
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
0
3
5
8
9
10
12
14
18
20
23
35
40
60
90
95
General..
 For a given stage, a process
communicates with only one other
process
 Communications are for only logP steps
 In a given step i, the communicating
process is determined by the ith bit
Drawbacks
 Bitonic sort moves data between pairs of
processes
 Moves data O(logP) times
 Bottleneck for large P
 Sample Sort
Sample Sort
 A sample of data of size s is collected
from each processor; then samples are
combined on a single processor
 The processor produces p-1 splitters
from the sp-sized sample; broadcasts
the splitters to others
 Using the splitters, processors send
each key to the correct final destination
Parallel Sorting by Regular
Sampling (PSRS)
1. Each processor sorts its local data
2. Each processor selects a sample vector
of size p-1; kth element is (n/p *
(k+1)/p)
3. Samples are sent and merge-sorted on
processor 0
4. Processor 0 defines a vector of p-1
splitters starting from p/2 element; i.e.,
kth element is p(k+1/2); broadcasts to
the other processors
PSRS
5. Each processor sends local data to
correct destination processors based on
splitters; all-to-all exchange
6. Each processor merges the data chunk it
receives
Step 5
 Each processor finds where each of the
p-1 pivots divides its list, using a binary
search
 i.e., finds the index of the largest
element number larger than the jth pivot
 At this point, each processor has p
sorted sublists with the property that
each element in sublist i is greater than
each element in sublist i-1 in any
processor
Step 6
 Each processor i performs a p-way
merge-sort to merge the ith sublists of
p processors
Example
Example Continued
Analysis
 The first phase of local sorting takes
O((n/p)log(n/p))
 2nd phase:
 Sorting p(p-1) elements in processor 0 – O(p2logp2)
 Each processor performs p-1 binary searches of n/p
elements – plog(n/p)
 3rd phase: Each processor merges (p-1) sublists
 Size of data merged by any processor is no more than
2n/p (proof)
 Complexity of this merge sort 2(n/p)logp
 Summing up: O((n/p)logn)
Analysis
 1st phase – no communication
 2nd phase – p(p-1) data collected; p-1
data broadcast
 3rd phase: Each processor sends (p-1)
sublists to other p-1 processors;
processors work on the sublists
independently
Analysis
Not scalable for large number of
processors
Merging of p(p-1) elements done on one
processor; 16384 processors require 16
GB memory
Sorting by Random Sampling
 An interesting alternative; random
sample is flexible in size and collected
randomly from each processor’s local
data
 Advantage
 A random sampling can be retrieved before
local sorting; overlap between sorting and
splitter calculation
Sources/References
 On the versatility of parallel sorting
by regular sampling. Li et al. Parallel
Computing. 1993.
 Parallel Sorting by regular sampling.
Shi and Schaeffer. JPDC 1992.
 Highly scalable parallel sorting.
Solomonic and Kale. IPDPS 2010.
 END
Bitonic Sort - Compare-splits
 When dealing with a block of elements per process,
instead of compare-exchange, use compare-split
 i.e, each process sorts its local elementsl then each
process in a pair sends all its elements to the
receiving process
 Both processes do the rearrangement with all the
elements
 The process then sends only the necessary
elements in the rearranged order to the other
process
 Reduces data communication latencies
Block of elements and Compare
Splits
 Think of blocks as elements
 Problem of sorting p blocks is identical to
performing bitonic sort on the p blocks
using compare-split operations
 log2P steps
 At the end, all n elements are sorted since
compare-splits preserve the initial order in
each block
 n/p elements assigned to each process are
sorted initially using a fast sequential
algorithm
Block of Elements per Process
Hypercube and Mesh
 Similar to one element per process case,
but now we have p blocks of size n/p, and
compare exchanges are replaced by
compare-splits
 Each compare-split takes O(n/p)
computation and O(n/p) communication time
 For hypercube, the complexity is:
 O(n/p log(n/p)) for sorting
 O(n/p log2p) for computation
 O(n/p log2p) for communication
Histogram Sort
 Another splitter-based method
 Histogram also determines a set of p-1
splitters
 It achieves this task by taking an
iterative approach rather than one big
sample
 A processor broadcasts k (> p-1) initial
splitter guesses called a probe
 The initial guesses are spaced evenly
over data range
Histogram Sort
Steps
1. Each processor sorts local data
2. Creates a histogram based on local data
and splitter guesses
3. Reduction sums up histograms
4. A processor analyzes which splitter
guesses were satisfactory (in terms of
load)
5. If unsatisfactory splitters, the ,
processor broadcasts a new probe, go to
step 2; else proceed to next steps
Histogram Sort
Steps
6. Each processor sends local data to
appropriate processors – all-to-all
exchange
7. Each processor merges the data chunk it
receives
Merits:
 Only moves the actual data once
 Deals with uneven distributions
Probe Determination
 Should be efficient – done on one
processor
 The processor keeps track of bounds for
all splitters
 Ideal location of a splitter i is (i+1)n/p
 When a histogram arrives, the splitter
guesses are scanned
Probe Determination
 A splitter can either
 Be a success – its location is within some threshold of the
ideal location
 Or not – update the desired splitter to narrow the range
for the next guess
 Size of a generated probe depends on how many
splitters are yet to be resolved
 Any interval containing s unachieved splitters is
subdivided with sxk/u guess where u is the total
number of unachieved splitters and k is the number
of newly generated splitters
Merging and all-to-all overlap
 For merging p arrays at the end
 Iterate through all arrays simultaneously
 Merge using a binary tree
 In the first case, we need all the arrays
to have arrived
 In the second case, we can start as soon
as two arrays arrive
 Hence this merging can be overlapped
with all-to-all
Radix Sort
 During every step, the algorithm puts
every key in a bucket corresponding to
the value of some subset of the key’s
bits
 A k-bit radix sort looks at k bits every
iteration
 Easy to parallelize – assign some subset
of buckets to each processor
 Lad balance – assign variable number of
buckets to each processor
Radix Sort – Load Balancing
 Each processor counts how many of its
keys will go to each bucket
 Sum up these histograms with reductions
 Once a processor receives this combined
histogram, it can adaptively assign
buckets
Radix Sort - Analysis
 Requires multiple iterations of costly allto-all
 Cache efficiency is low – any given key
can move to any bucket irrespective of
the destination of the previously indexed
key
 Affects communication as well
Download