Lecture # 3: A Primer on How to Design Parallel Algorithms Shantanu Dutt University of Illinois at Chicago Parallel Algorithm Design—A Process/Data View Break up the computation in either a data parallel or functional parallel way depending on the problem at hand Data parallel breakup: Easier to generalize than functional breakup, and more structured Input data parallel (e.g., the max and general associative computation, finite-element computations—the heat equation problem) or Output data parallel (e.g., matrix multiplication) Each data partition defines a computational partition or process that works on that partition 1st attempt: breakup somehow, determine how to communicate to solve the problem 2nd attempt: breakup in a way that the reqd. communication complexity (topology indep.) is min. 3rd attempt: breakup so that commun. complexity on a particular topology is minimized I/P data: Linear or unstructured (e.g., associative functions): Easy to partition into P (N/P)-size pieces 1 P 2 I/P data: Dimensional or Structured (e.g., finite-elt. grid): Can be partitioned by same dimensionality or in fewer dimensions. Choice based on resulting communication pattern and the target topology Inter-process communication Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition An example parallel algorithm for a finite element computation Simple, structured and sparse communication needed. Example: Heat Equation - The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements data1 data2 Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 3 …... data N 1. 2. 3. 4. 5. 6. 7. 8. 9. 1. 2. 3. 4. 5. 6. 7. 8. find out if I am MASTER or WORKER if I am MASTER initialize array send each WORKER starting info and subarray do until all WORKERS converge gather from all WORKERS convergence data broadcast to all WORKERS convergence signal end do receive results from each WORKER 14. 15. 16. 17. 18. 19. 20. update border of my portion of solution array determine if my solution has converged if so {send MASTER convergence signal recv. from MASTER convergence signal} end do } send MASTER results endif Serial Code – repeat do y=2, N-1 do x=2, M-1 u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] + cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const. enddo enddo Master (can be one of the workers) u1 = u2; until u1 ~ u2. else if I am WORKER receive from MASTER starting info and subarray do until solution converged { update time send (non-blocking?) neighbors my border info receive (non-blocking?) neighbors border info Workers update interior of my portion of solution array (see comput. given in the serial code) wait for non-block. commun. (if any) to complete Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur Problem Grid 4 Parallel Algorithm Design—A Process/Data View: Dimensional Partitioning Inter-process communication Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition A full dimensional partition will have the smallest circumference (amount of data that needs to be communicated to neighboring processes) to area (total data elements per process = N/P) ratio, and thus has the least topology-independent communication cost A 2D partition (Fig. a) of a 2D grid of nxn points, gives us [n/sqrt(P)]x [n/sqrt(P)] partitions, where n=sqrt(N). Thus total data that needs to be communicated (per iteration of a finite element computation) is 2* [n/sqrt(P)] (for corner partitions) to 4* [n/sqrt(P)] (for interior partitions). A 1D partition (Fig. b) of a 2D grid of nxn points, gives us nx(n/P) partitions, requiring a data amount of n to 2n to be communicated which is about sqrt(P)/2 times more data than for a 2D partition. However, if one considers the topology of the target parallel system, a 1D partition is suitable for both a linear and 2D topology, while a 2D partition is suitable for a 2D topology but is expensive for a linear topology in terms of average message hops (and thus also in terms of more contention/conflict) An example parallel algorithm for a finite element computation: Analysis data1 data2 …... data N Analysis: Let L be the max dist. of a grid point from the heat source. The # of iterations for convergence will be a function f(L) of L. Assume linear partition of i/p data and near-neighbor communication only (possible for most topologies for linear data partitioning). Per iteration computation time per process = Q(N/P). Comm. time per iter. per process = Q(sqrt(N)). All computations in an iteration occur in parallel (there is no data dependency among the computations in each process for the same iteration). Commun. also occurs completely in parallel in an iteration. Also, near-neighbor no contention. So parallel time per iteration is Q(N/P + sqrt(N)) [is sometimes also written as Q(max(N/P, sqrt(N))]. Number of parallel iterations? Parallel Time Tp(P) = Seq. time Tseq = Tp(1) = Speedup Sp(P) = Efficiency Ep(P) = Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 6 Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Finite elt. grid: 1 5 2 6 3 7 4 Processes mapped to data (& thus comput.) 1 2 3 4 5 6 7 8 8 Row major (a) 2D (full dimensional) partition order is used to map 2D 4-hop messages data partition to a 1D processor 1 2 3 array (b) 1D (reduced dimensional) partition 4 5 6 7 8 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array. Assumptions: Total grid points: N in a sqrt(N)xsqrt(N) arrangement Commun. time = ts + k*th + m*tw, where ts is the startup time at source and target procs, k = # of hops, th the decision time per switch, m = msg. size in “flits” (# of bits that can be transmitted in parallel), tw is propagation delay per flit. ts does not affect congestion, so for congestion/delay purposes we approx. commun. time here by k*th + m*tw Analysis: 1D data partition & linear topology: All commun. is near neighbor and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = th + sqrt(N)*tw ~ Q(sqrt(N)) for large N Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Finite elt. grid: 1 5 Row major order is used to map 2D data partition to a 1D processor array 2 6 3 7 4 Processes mapped to data (& thus comput.) 2 3 2 3 4 5 6 7 8 8 (a) 2D (full dimensional) partition 4-hop messages 1 1 (b) 1D (reduced dimensional) partition 4 5 6 7 8 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array Analysis: 2D data partition & linear topology: Near neighbor commun. part is parallel and w/o contentions. So parallel commun. time per iteration of FE computation= th + (sqrt(N)/2)*tw = Q(sqrt(N)/2), for large N. 2nd dim. communication is for a distance of 4 hops (generally about sqrt(P)-hops): Msg. from 4 8 will take time 4* th + (sqrt(N)/4)*tw ~ Q(sqrt(N)/4) for large N. Assuming cut- through/wormhole routing in which an entire path is first reserved and then the data passes through via multiple interconnects as if it is a single interconnect, msgs from 3 7, 2 6, 1 5 are blocked until 4 8 communication is completed. After this is finished, say, 3 7 comm. occurs, while the others are blocked. So commun. is sequentialized for 4-hop messages along an intersecting route. So commun. along dim. 2 of the data is not fully parallel (they are started in parallel but get sequentialized due to contention) and takes a total parallel time of 4(4* th + (sqrt(N)/4)*tw ) ~ Q(sqrt(N) for large N, similar in order notation to 1D partiton, but will actually take more time in terms of the th terms (16* th vs. th ). Total parallel commun. time is Q(1.5 sqrt(N)), worse than for 1D partitioning.What would this be for a general P? Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Finite elt. grid: 1 5 2 6 3 7 4 Processes mapped to data (& thus comput.) 1 2 3 4 5 6 7 8 8 (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition 1 2 3 4 1 2 3 4 5 6 7 8 8 7 6 5 (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue), & corresponding processes to a 2D array/mesh. The latter is equivalent to embedding a linear array onto a 2D array to minimize average dilation (the factor by which the length of a near-neighbor commun. or link increases in the host/target topology. Analysis: 1D data partition & 2D topology: All commun. is near neighborr and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = th + sqrt(N)*tw ~ Q(sqrt(N)) for large N Analysis: 2D data partition & 2D topology: All commun. is near neighborr and hence w/o contention. However, each processor has two communication steps that are partly sequentialized: Worst-case: Parallel commun. time per iteration in a FE computation = th + (sqrt(N)/2)*tw + th + (sqrt(N)/4)*tw ~ Q(0.75*sqrt(N)) for large N Best (close to Average) case: The bulk of both commun. which is data transfer takes place in parallel but the startup time (ts) is sequentialized. So parallel commun. time per iteration = max(th + (sqrt(N)/2)*tw , th + (sqrt(N)/4)*tw ) ~ Q(0.5*sqrt(N)) for large N. For general P? Parallel Algorithm Design—A Process/Data View: O/P Data Partitioning • Output data partitioning: E.g., Matrix multiplication A X B = Each process computes a block of data of C (one or more Cij’s) C Issues of input data distribution among the processes: If to compute elt. Cij, row Ai and col Bj are available at the process then no communication needed to move input data to the right processes If each process computing a block of C has similar blocks of A, B data, then data needs to be moved around the processes so that the one computing Cij has Ai and Bj available. This will require partial all-to-all broadcasts; see below. • Each process broadcasts/multicasts the block of A data it has to processes in its “row” and similarly the block of B data to processes in its column. B • After the above stage, every process A computing each elt. Cij, will have row Ai and col. Bj • The time complexity of the above phase will depend on the interconn. topology of the multiprocessor Parallel Algorithm Design—Strategy Summary Data partitioning: defines processes; based partly on amount & pattern of communication needed and target topology Determine communication pattern required Granularity of breakup: N/P for i/p data parallel or M/P for o/p data parallel can either be decided based on given P (straightforward) or based on overhead (i.e., P to be determined) to minimize either parallel time Tp or efficiency Ep or work efficiency WEp (see below) Map processes to processors in target topology to: reduce average message hops (try to maximize near-neighbor communication) maximize parallelization of message passing across processes minimize conflicts among message paths (conflict depends partly on type of routing “flow control”—store-and-forward or wormhole/cut-through; we assumed the latter in previous analysis) Overall Goals: minimize parallel time Tp (or speedup Sp) maximize efficiency Ep = Sp/P maximize work efficiency WEp = Sp/WSp, where WSp is total work scale-up (correlated to total power consumed) across the P processors compared to the total sequential work Ws in the computation = Wp/Ws, where Wp is the total work done across P processors (= total time across P procs) More on parallelization metrics later MIMD/SPMD Coding Style Loop Do some computation; send data and/or receive data Do some more computation; send data and/or receive data etc. End Loop