Introduction Load balancing Optimal local load balancing strategy Conclusions Introduction Load balancing Optimal local load balancing strategy Conclusions Outline Load Balancing of Irregular Parallel Divide-and-Conquer Algorithms in Group-SPMD Programming Environments Mattias Eriksson Christoph Kessler 1 Introduction Parallel quicksort Imbalance in workload 2 Load balancing Strategies for load balancing 3 Optimal local load balancing strategy Problem: When to serialise/repivot Dynamic programming method Mikhail Chalabine Linköping university, Sweden Originally presented at the PASA workshop — 16 mars 2006 Introduction Load balancing Optimal local load balancing strategy Conclusions Parallel Quicksort Introduction Load balancing Optimal local load balancing strategy Conclusions Imbalanced workload in parallel quicksort A processor group selects a shared pivot and partitions S into L = {x ∈ S|x ≤ π} and R = {x ∈ S|x ≥ π}. The processor group splits into two groups. The first processor group sorts L and the second processor group sorts R. With subproblem sizes being data-dependent a good processor assignment to subproblems is not always possible. 1000 keys p0 ,p1 ,p2 S p0 ,p1 ,p2 ,p3 10 keys p0 L R p1 p0 ,p2 ,p3 990 keys p1 ,p2 Imbalance may accumulate in the parallel phase. Examples of DC-algorithms with data-dependent (irregular) subproblem sizes: quicksort and quickhull. Examples of DC-algorithms where subproblem sizes are fixed: mergesort, FFT and Strassen matrix multiplication. Computing resources are wasted. Introduction Load balancing Optimal local load balancing strategy Conclusions The manager strategy Introduction Load balancing Optimal local load balancing strategy Conclusions The task queue strategy One of the processors is dedicated to coordinating the others. After partitioning is done in the one-processor phase, one of the subtasks is put in a shared task queue. When a processor has no more work to do, it fetches a new task from the queue. S To task queue L + A processor with high workload can get help from an idle processor. - The manager does no sorting at all. Introduction Load balancing Optimal local load balancing strategy Conclusions The repivoting strategy R + The load imbalance will be very small. - Managing the task queue will cause overhead. Introduction Load balancing Optimal local load balancing strategy Conclusions The serialisation strategy If processor groups will become imbalanced a new pivot is selected. 1000 keys If processor groups will become imbalanced the whole processor group will first sort the first subproblem and then the other subproblem. 1000 keys p0 ,p1 ,p2 p0 ,p1 ,p2 10 keys p0 990 keys p1 ,p2 + Bad load balance between processor groups is avoided. - When a new pivot is selected the partitioning will have to be redone. - Does not work where pivot is uniquely determined (eg. quickhull). - Exactly how do we define bad load balance? 10 keys 990 keys p0 ,p1 ,p2 p0 ,p1 ,p2 + No load imbalance at all when serialising the execution in the parallel phase. - The processors will execute more work in the parallel phase. - Exactly how do we define bad load balance? Introduction Load balancing Optimal local load balancing strategy Conclusions Optimal local strategy: example cost P = 2 and N = 1000 Splitting Serialising Repivoting 200 400 600 800 Optimal strategy is sometimes repivoting/serialisation and sometimes group splitting. We want to find the threshold values! Load balancing Optimal local load balancing strategy Conclusions Question: When is it worthwhile to serialise/repivot? Find out numerically with dynamic programming Dynamic programming is a method for solving problems of a recursive nature. Calculate E[Tp (n)], for increasingly large p and n, using previously calculated values. Also calculate which strategy of group splitting, repivoting and serialising is best for a given situation. p n1 8 8 7 7 6 6 p=2 1000 n1 Introduction Load balancing Optimal local strategy: numerical approach Question: When is it worthwhile to serialise/repivot? Example: 2 processors sort 1000 elements. n1 is size of the first subproblem. 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Introduction Optimal local load balancing strategy Conclusions Having a look at the results 5 5 4 4 3 s s 1 2 3 4 5 Introduction 6 7 8 s s s s 3 2 t9 ? 1 t1 t2 t3 t4 t5 t6 t7 t8 s s s s s s 2 1 s n 1 s 2 s 3 s 4 s 5 Load balancing s s s s s s 6 7 8 n Optimal local load balancing strategy Conclusions Results of dynamic programming We found from dynamic programming that: On average serialisation is always better than repivoting. P = 7 and N = 1000000 Serialisation is better than group splitting when one of the subproblems is very small compared to the other. 900000 850000 800000 MPI-thresholds cost 750000 p n 105 106 107 108 109 700000 650000 Splitting Sequentialising Repivoting 600000 550000 500000 0 200000 400000 600000 n1 n1 = n/2 is not the optimal split point! 800000 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .156 .328 .382 .401 .407 0 .216 .265 .281 .284 0 .136 .191 .206 .210 0 .076 .143 .159 .163 0 .022 .109 .126 .131 0 0 .083 .102 .107 0 0 .062 .083 .088 0 0 .042 .067 .073 0 0 .026 .054 .061 0 0 .011 .043 .051 0 0 0 .033 .042 0 0 0 .025 .034 0 0 0 .015 .027 0 0 0 .008 .021 0 0 0 0 .016 1e+06 Example: Consider the case p = 3 and n = 107 . If n1 /n < .265 serialisation is better than group splitting. We implement a hybrid local heuristic, parameterized by this table (or a corresponding table for Fork). Introduction Load balancing Optimal local load balancing strategy Conclusions Execution times in Fork Absolute speedup Absolute speedup 0 2 4 6 Load balancing 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of processors Optimal local load balancing strategy Execution times in MPI with a better pivot Early investing more time in better pivoting can reduce need for load balancing. Example: Speedups of quicksort in MPI. In this example 100 million integers are sorted. Pivot is selected as average of 7 elements. Absolute speedup Optimal local load balancing strategy Conclusions Speedups of quicksort on a cluster with Intel Xeon processors. In this example 100 million integers are sorted. With task queue With repivoting With serialisation No load balancing 16 15 Serialisation Manager 14 13 No load balancing 12 11 10 9 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of processors 12 Serialisation 11 Manager 10 No load balancing 9 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Number of processors Here our local load balancing strategy with serialisation outperforms the dynamic manager solution! The dynamic load balancing strategy with a task queue performs better than our strategy with serialisation. SP-PRAM: Nonblocking task queue operations, low overhead. Introduction Load balancing Execution times in MPI Speedups of quicksort running on the SB-PRAM simulator. In this example 40000 integers are sorted. 12 11 10 9 8 7 6 5 4 3 2 1 0 Introduction One processor dedicated; communication overhead. Conclusions Introduction Load balancing Optimal local load balancing strategy Conclusions Conclusions We have proposed and studied new strategies for local load balancing in SPMD: repivoting and serialisation. We developed an optimal hybrid local load balancing method, calibrated with threshold tables derived from offline dynamic programming. Our hybrid local strategy outperforms the global dynamic load balancing (manager solution) on the MPI cluster. Global dynamic load balancing (task queue solution) outperforms our local hybrid strategy on the SB-PRAM where synchronization is very cheap.