Sorting

Parallel Sorting Sathish Vadhiyar Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key on processor k is larger than every key on processor k-1  The number of keys on any processor should not be larger than (n/p + thres)  Communication-intensive due to large migration of data between processors Bitonic Sort  One of the traditional algorithms for parallel sorting  Follows a divide-and-conquer algorithm  Also has nice properties – only a pair of processors communicate at each stage  Can be mapped efficiently to hypercube and mesh networks Bitonic Sequence  Rearranges a bitonic sequence into a sorted sequence  Bitonic sequence – sequence of elements (a0,a1,a2,…,an-1) such that a0 ai an-1  Or there exists a cyclic shift of indices satisfying the above  E.g.: (1,2,4,7,6,0) or (8,9,2,1,0,4) Using bitonic sequence for sorting  Let s = (a0,a1,…,an-1) be a bitonic sequence such that a0<=a1<=…<=an/2-1 and an/2>=an/2+1>=…>=an-1  Consider S1 = (min(a0,an/2),min(a1,an/2+1),….,min(an/2-1,an-1)) and S2 = (max(a0,an/2),max(a1,an/2+1),….,max(an/2-1,an-1)) Both are bitonic sequences Every element of s1 is smaller than s2 Using bitonic sequence for sorting  Thus, initial problem of rearranging a bitonic sequence of size n is reduced to problem of rearranging two smaller bitonic sequences and concatenating the results  This operation of splitting is bitonic split  This is done recursively until the size is 1 at which point the sequence is sorted; number of splits is logn  This procedure of sorting a bitonic sequence using bitonic splits is called bitonic merge Bitonic Merging Network 3 5 + 8 3 + 9 5 + 10 8 + 12 + 10 + 20 90 60 40 35 23 18 0 + + 12 + + 14 + 95 + 60 + + + 0 90 + 3 9 14 95 + + 40 + 35 + 23 + 18 + 20 + + 5 + 8 + 0 10 12 + 14 + 9 35 23 + 18 + 20 95 90 + 60 + 40 + + + + + + + + 3 + + + 0 8 5 10 9 14 + 12 + 20 + + + 18 35 23 60 40 95 90 + + + + + + + + + + + + + + + + 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95 Takes a bitonic sequence and outputs sorted order; contains logn columns A bitonic merging network with n inputs denoted as + BM[n] Sorting unordered n elements  By repeatedly merging bitonic sequences of increasing length +BM[2] - BM[2] +BM[2] - BM[2] +BM[2] - BM[2] +BM[2] - BM[2] +BM[4] +BM[8] - BM[4] +BM[16] +BM[4] - BM[8] - BM[4] •An unsorted sequence can be viewed as a concactenation of bitonic sequences of size two •Each stage merges adjancent bitonic sequences into increasing and decreasing order •Forming a larger bitonic sequence Bitonic Sort  Eventually obtain a bitonic sequence of size n which can be merged into a sorted sequence  Figure 9.8 in your book  Total number of stages, d(n) = d(n/2)+logn = O(log2n)  Total time complexity = O(nlog2n) Parallel Bitonic Sort Mapping to a Hypercube  Imagine N processes (one element per process).  Each process id can be mapped to the corresponding node number of the hypercube.  Communications between processes for compare-exchange operations will always be neighborhood communications  In the ith step of the final stage, processes communicate along the (d-(i-1))th dimension  Figure 9.9 in the book Parallel Bitonic Sort Mapping to a Mesh  Connectivity of a mesh is lower than that of hypercube  One mapping is row-major shuffled mapping 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15  Processes that do frequent compareexchanges are located closeby Mesh..  For example, processes that perform compare-exchange during every stage of bitonic sort are neighbors 0 1 4 5 2 3 6 7 8 9 12 13 10 11 14 15 Block of Elements per Process General 3 5 + 8 3 + 9 5 + 10 8 + 12 + 10 + 20 90 60 40 35 23 18 0 + + 12 + + 14 + 95 + 60 + + + 0 90 + 3 9 14 95 + + 40 + 35 + 23 + 18 + 20 + + 5 + 8 + 0 10 12 + 14 + 9 35 23 + 18 + 20 95 90 + 60 + 40 + + + + + + + + 3 + + + 0 8 5 10 9 14 + 12 + 20 + + + 18 35 23 60 40 95 90 + + + + + + + + + + + + + + + + 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95 General..  For a given stage, a process communicates with only one other process  Communications are for only logP steps  In a given step i, the communicating process is determined by the ith bit Drawbacks  Bitonic sort moves data between pairs of processes  Moves data O(logP) times  Bottleneck for large P  Sample Sort Sample Sort  A sample of data of size s is collected from each processor; then samples are combined on a single processor  The processor produces p-1 splitters from the sp-sized sample; broadcasts the splitters to others  Using the splitters, processors send each key to the correct final destination Parallel Sorting by Regular Sampling (PSRS) 1. Each processor sorts its local data 2. Each processor selects a sample vector of size p-1; kth element is (n/p * (k+1)/p) 3. Samples are sent and merge-sorted on processor 0 4. Processor 0 defines a vector of p-1 splitters starting from p/2 element; i.e., kth element is p(k+1/2); broadcasts to the other processors PSRS 5. Each processor sends local data to correct destination processors based on splitters; all-to-all exchange 6. Each processor merges the data chunk it receives Step 5  Each processor finds where each of the p-1 pivots divides its list, using a binary search  i.e., finds the index of the largest element number larger than the jth pivot  At this point, each processor has p sorted sublists with the property that each element in sublist i is greater than each element in sublist i-1 in any processor Step 6  Each processor i performs a p-way merge-sort to merge the ith sublists of p processors Example Example Continued Analysis  The first phase of local sorting takes O((n/p)log(n/p))  2nd phase:  Sorting p(p-1) elements in processor 0 – O(p2logp2)  Each processor performs p-1 binary searches of n/p elements – plog(n/p)  3rd phase: Each processor merges (p-1) sublists  Size of data merged by any processor is no more than 2n/p (proof)  Complexity of this merge sort 2(n/p)logp  Summing up: O((n/p)logn) Analysis  1st phase – no communication  2nd phase – p(p-1) data collected; p-1 data broadcast  3rd phase: Each processor sends (p-1) sublists to other p-1 processors; processors work on the sublists independently Analysis Not scalable for large number of processors Merging of p(p-1) elements done on one processor; 16384 processors require 16 GB memory Sorting by Random Sampling  An interesting alternative; random sample is flexible in size and collected randomly from each processor’s local data  Advantage  A random sampling can be retrieved before local sorting; overlap between sorting and splitter calculation Sources/References  On the versatility of parallel sorting by regular sampling. Li et al. Parallel Computing. 1993.  Parallel Sorting by regular sampling. Shi and Schaeffer. JPDC 1992.  Highly scalable parallel sorting. Solomonic and Kale. IPDPS 2010.  END Bitonic Sort - Compare-splits  When dealing with a block of elements per process, instead of compare-exchange, use compare-split  i.e, each process sorts its local elementsl then each process in a pair sends all its elements to the receiving process  Both processes do the rearrangement with all the elements  The process then sends only the necessary elements in the rearranged order to the other process  Reduces data communication latencies Block of elements and Compare Splits  Think of blocks as elements  Problem of sorting p blocks is identical to performing bitonic sort on the p blocks using compare-split operations  log2P steps  At the end, all n elements are sorted since compare-splits preserve the initial order in each block  n/p elements assigned to each process are sorted initially using a fast sequential algorithm Block of Elements per Process Hypercube and Mesh  Similar to one element per process case, but now we have p blocks of size n/p, and compare exchanges are replaced by compare-splits  Each compare-split takes O(n/p) computation and O(n/p) communication time  For hypercube, the complexity is:  O(n/p log(n/p)) for sorting  O(n/p log2p) for computation  O(n/p log2p) for communication Histogram Sort  Another splitter-based method  Histogram also determines a set of p-1 splitters  It achieves this task by taking an iterative approach rather than one big sample  A processor broadcasts k (> p-1) initial splitter guesses called a probe  The initial guesses are spaced evenly over data range Histogram Sort Steps 1. Each processor sorts local data 2. Creates a histogram based on local data and splitter guesses 3. Reduction sums up histograms 4. A processor analyzes which splitter guesses were satisfactory (in terms of load) 5. If unsatisfactory splitters, the , processor broadcasts a new probe, go to step 2; else proceed to next steps Histogram Sort Steps 6. Each processor sends local data to appropriate processors – all-to-all exchange 7. Each processor merges the data chunk it receives Merits:  Only moves the actual data once  Deals with uneven distributions Probe Determination  Should be efficient – done on one processor  The processor keeps track of bounds for all splitters  Ideal location of a splitter i is (i+1)n/p  When a histogram arrives, the splitter guesses are scanned Probe Determination  A splitter can either  Be a success – its location is within some threshold of the ideal location  Or not – update the desired splitter to narrow the range for the next guess  Size of a generated probe depends on how many splitters are yet to be resolved  Any interval containing s unachieved splitters is subdivided with sxk/u guess where u is the total number of unachieved splitters and k is the number of newly generated splitters Merging and all-to-all overlap  For merging p arrays at the end  Iterate through all arrays simultaneously  Merge using a binary tree  In the first case, we need all the arrays to have arrived  In the second case, we can start as soon as two arrays arrive  Hence this merging can be overlapped with all-to-all Radix Sort  During every step, the algorithm puts every key in a bucket corresponding to the value of some subset of the key’s bits  A k-bit radix sort looks at k bits every iteration  Easy to parallelize – assign some subset of buckets to each processor  Lad balance – assign variable number of buckets to each processor Radix Sort – Load Balancing  Each processor counts how many of its keys will go to each bucket  Sum up these histograms with reductions  Once a processor receives this combined histogram, it can adaptively assign buckets Radix Sort - Analysis  Requires multiple iterations of costly allto-all  Cache efficiency is low – any given key can move to any bucket irrespective of the destination of the previously indexed key  Affects communication as well

Sorting

Related documents

Products

Support

Sorting

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib