Problem Solving Strategies • Partitioning – Divide the problem into disjoint parts – Compute each part separately • Divide and Conquer – – – – Divide Phase: Recursively create sub-problems of the same type Base case Reached: Execute an algorithm Conquer phase: Merge the results as the recursion unwinds Traditional Example: Merge Sort • Where is the work? – Partitioning: Creating disjoint parts of the problem – Divide and Conquer: Merging the separate results – Traditional Example: Quick Sort Parallel Sorting Considerations • Distributed memory – – – – Distributed system precision differences can cause unpredictable results Traditional algorithms can require excessive communication Modified algorithms minimize communication requirements Typically, data is scattered to the P processors • Shared Memory – Critical sections and Mutual Exclusion locks can inhibit performance – Modified algorithms eliminate the need for locks – Each processor can sort N/P data points or they can work in parallel in a more fine grain manner (no need for processor communication). Two Related Sorts Bubble Sort void bubble(char[] *x, int N) { int sorted=0, i, size=N-1; char* temp; while (!sorted) { sorted=1; for (i=0;i<size;i++) { if (strcmp(x[i],x[i+1]>0) { strcpy(temp,x[i]); strcpy(x[i],x[i+1]); strcpy(x[i+1],temp); sorted = 0; } } size--; } } 1. 2. Odd-Even Sort void oddEven(char[] *x, int N) { int even=0,sorted=0,i,size=N-1; char *temp; while(!sorted) { sorted=1; for(i=even; i<size; i+=2) { if(strcmp(x[i],x[i+1]>0) { strcpy(temp,x[i]); strcpy(x[i],x[i+1]); strcpy(x[i+1],temp); sorted = 0; } } even = 1 – even; } } Sequential version: Odd-Even has no advantages Parallel version: Processors can work independently without data conflicts Bubble, Odd Even Example Bubble Pass Odd Even Pass Bubble: Smaller values move left one spot per pass. Largest value move immediately to the end. The loop size can shrink by one each pass. Odd Even: Large values move only one position per pass. The loop size cannot shrink. However, all interchanges can occur in parallel. One Parallel Iteration Distributed Memory • Odd Processors: sendRecv(pr data, pr-1 data); mergeHigh(pr data, pr-1 data) if(r<=P-2) { sendRecv(pr data, pr+1 data); mergeLow(pr data, pr+1 data) } • Even Processors: sendRecv(pr data, pr+1 data); mergeLow(pr data, pr+1 data) if(r>=1) { sendrecv(pr data, Pr-1 data); mergeHigh(pr data, pr-1 data) } Shared Memory • Odd Processors: mergeLow(pr data, pr-1 data) ; Barrier if (r<=P-2) mergeHigh(pr data,pr+1 data) Barrier • Even Processors: mergeHigh(pr data, pr+1 data) ; Barrier if (r>=1) mergeLow(pr data, pr-1 data) Barrier Notation: r = Processor rank, P = number of processors, pr data is the block of data belonging to processor, r Note: P/2 Iterations are necessary to complete the sort A Distributed Memory Implementation • Scatter the data among available processors • Locally sort N/P items on each processor • Even Passes – Even processors, p<N-1, exchange data with processor, p+1. – Processors, p and p+1 perform a partial merge where p extracts the lower half and p+1 extracts the upper half. • Odd Passes – Even processors, p>=2, exchange data with processor, p-1. – Processors, p, and p-1 perform a partial merge where p extracts the upper half and p-1 extracts the lower half. • Exchanging Data: MPI_Sendrecv Partial Merge – Lower keys Store the lower n keys from arrays a and b into array c mergeLow(char[] *a, char[] *b, char *c, int n) { int countA=0, countB=0, countC=0; while (countC < n) { if (strcmp(a[countA],b[countB]) { strcpy(c[countC++], a[countA++]); } else { strcpy(c[countC++], a[countB++); } } } To merge upper keys: 1. Initialize the counts to n-1 2. Decrement the counts instead of increment 3. Change the countC < n to countC >= 0 Bitonic Sequence [3,5,8,9,10,12,14,20] [95,90,60,40,35,23,18,0] 10,12,14,20] [95,90,60,40,35,23,18,0] [3,5,8,9 Increasing and then decreasing where the end can wrap around Unsorted: 10,20,5,9.3,8,12,14,90,0,60,40,23,35,95,18 Step 1: 10,20 9,5 3,8 14,12 0,90 60,40 23,35 95,18 Step 2: [9,5][10,20][14,12][3,8][0,40][60,90][95,35][23,18] 5,9 10,20 14,12 8,3 0,40 60,90 95,35 23,18 Step 3: [5,9,8,3][14,12,10,20] [95,40,60,90][0,35,23,18] [5,3][8,9][10,12][14,20] [95,90][60,40][23,35][0,18] 3,5, 8,9, 10,12, 14,20 95,90, 60,40, 35,23, 18,0 Step 4: [3,5,8,9,10,12,14,0] [95,90,60,40,35,23,18,20] [3,5,8,0] [10,12,14,9] [35,23,18,20][95,90,60,40] [3,0][8,5] [10,9][14,12] [18,20][35,23] [60,40][95,90[ Sorted: 0,3,5,8,9,10,12,14,18,20,23,35,40,60,90,95 Bitonic Sort Bitonic Sorting Functions void bitonicSort(int lo, int n, int dir) void bitonicMerge(int lo, int n, int dir) { { if (n>1) if (n>1) { { int m=n/2; int m=n/2; for (int i=lo; i<lo+m; i++) bitonicSort(lo, m, UP); compareExchange(i, i+m, dir); bitonicSort(lo+m, m, DOWN; bitonicMerge(lo, m, dir); bitonicMerge(lo, n, dir); bitonicMerge(lo+m, m, dir); } } } } Notes: 1. 2. dir = 0 for DOWN, and 1 for UP compareExchange moves a. low value left if dir = UP b. high value left if dir = DOWN Bitonic Sort Partners/Direction level j rank rank rank rank rank rank rank rank rank rank rank rank rank rank rank rank = = = = = = = = = = = = = = = = 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 partners partners partners partners partners partners partners partners partners partners partners partners partners partners partners partners 1 0 = = = = = = = = = = = = = = = = 1/L, 0/H, 3/H, 2/L, 5/L, 4/H, 7/H, 6/L, 9/L, 8/H, 11/H, 10/L, 13/L, 12/H, 15/H, 14/L, 2 0 2/L 3/L 0/H 1/H 6/H 7/H 4/L 5/L 10/L 11/L 8/H 9/H 14/H 15/H 12/L 13/L Algorithm Steps 2 3 3 3 1 0 1 2 1/L, 0/H, 3/L, 2/H, 5/H, 4/L, 7/H, 6/L, 9/L, 8/H, 11/L, 10/H, 13/H, 12/L, 15/H, 14/L, 4/L 5/L 6/L 7/L 0/H 1/H 2/H 3/H 12/H 13/H 14/H 15/H 8/L 9/L 10/L 11/L 2/L 3/L 0/H 1/H 6/L 7/L 4/H 5/H 10/H 11/H 8/L 9/L 14/H 15/H 12/L 13/L 1/L, 0/H, 3/L, 2/H, 5/L, 4/H, 7/L, 6/H, 9/H, 8/L, 11/H, 10/L, 13/H, 12/L, 15/H, 14/L, 4 0 4 1 4 2 4 3 8/L 9/L 10/L 11/L 12/L 13/L 14/L 15/L 0/H 1/H 2/H 3/H 4/H 5/H 6/H 7/H 4/L 5/L 6/L 7/L 0/H 1/H 2/H 3/H 12/L 13/L 14/L 15/L 8/H 9/H 10/H 11/H 2/L 3/L 0/H 1/H 6/L 7/L 4/H 5/H 10/L 11/L 8/H 9/H 14/L 15/L 12/H 13/H 1/L 0/H 3/L 2/H 5/L 4/H 7/L 6/H 9/L 8/H 11/L 10/H 13/L 12/H 15/L 14/H partner = rank ^ (1<<(level-j-1)); direction = ((rank<partner) == ((rank & (1<<level)) ==0)) Java Partner/Direction Code public static void main(String[] args) { int nproc = 16, partner, levels = (int)(Math.log(nproc)/Math.log(2)); for (int rank = 0; rank<nproc; rank++) { System.out.printf("rank = %2d partners = ", rank); for (int level = 1; level <= levels; level++ ) { for (int j = 0; j < level; j++) { partner = rank ^ (1<<(level-j-1)); String dir = ((rank<partner)==((rank&(1<<level))==0))?"L":"H"; System.out.printf("%3d/%s", partner, dir); } if (level<levels) System.out.print(", "); } System.out.println(); } } Parallel Bitonic Pseudo code IF master processor Create or retrieve data to sort Scatter it among all processors (including the master) ELSE Receive portion to sort Sort local data using an algorithm of preference FOR( level = 1; level <= lg(P) ; level++ ) FOR ( j = 0; j<level; j++ ) partner = rank ^ (1<<(level-j-1)); Exchange data with partner IF ((rank<partner) == ((rank & (1<<level)) ==0)) extract low values from local and received data (mergeLow) ELSE extract high values from local and received data (mergeHigh) Gather sorted data at the master Bucket Sort Partitioning • Algorithm: – Assign a range of values to each processor – Each processor sorts the values assigned – The resulting values are forwarded to the master • Steps 1. Scatter N/P numbers to each processor 2. Each Processor a. b. c. d. Creates smaller buckets of numbers for designated for each processor Sends the designated buckets to the various processors and receives the designated buckets it expects to receive Sorts its section Sends its data back to the processor with rank 0 Bucket Sort Partitioning Unsorted Numbers Unsorted Numbers P1 P2 P3 Pp Sorted Sequential Bucket Sort 1. 2. 3. 4. Drop sections of data to sort into buckets Sort each bucket Copy sorted bucket data back into the primary array Complexity O(b * (n/b lg(n/b)) Sorted Parallel Bucket Sort Notes: a. Bucket Sort works well for uniformly distributed data b. Recursively finding mediums from a data sample (Sample Sort) attempts to equalize bucket sizes Rank (Enumeration) Sort 1. Count the numbers smaller to each number, src[i] or duplicates with a smaller index 2. The count is the final array position for x for (i=0; i<N; i++) { count = 0; for (j=0; j<N; j++) if (src[i] > src[j] || src[i]=src[j] && j<i) x++; dest[x] = src[i]; } 3. Shared Memory parallel implementation a. b. Assign groups of numbers to each processor Find positions of N/P numbers in parallel Counting Sort Works on primitive fixed point types: int, char, long, etc. 1. Master scatters the data among the processors 2. In parallel, each processor counts the total occurrences for each of the N/P data points 3. Processors perform a collective sum operation 4. Processors performs an all-to-all collective prefix sum operation 5. In parallel, each processor stores the N/P data items appropriately in the output array 6. Sorted data gathered at the master processor Note: This logic can be repeated to implement a radix sort P0 Merge Sort P0 P4 P0 P0 P2 P1 P2 P4 P3 P4 P6 P5 P6 • Scatter N/P items to each processor • Sort Phase: Processors sort its data with a method of choice • Merge Phase: Data routed and a merge is performed at each level for (gap=1; gap<P; gap*=2) { if ((p/gap)%2 != 0) { Send data to p–gap; break; } else { Receive data from p+gap Merge with local data } } P7 Quick Sort • Slave computers – Perform the quick sort algorithm • Base Case: if data length < threshold, send to master (rank = 0) • Recursive Step: quick sort partition the data – Request work from the master processor • If none terminate • Receive data, sort and send back to master • Master computer – – – – Scatter N/P items to each processor When receive work request: Send data to slave or termination message When receive sorted data: Place data correctly in final data list When data sorted: save data and terminate Note: Distributed work pools requires load balancing. Processors maintain local work pools. When the local load queue falls below a threshold, processors request work from their neighbors