Lecture 2: Parallel Reduction Algorithms & Their Analysis on Different Interconnection Topologies Shantanu Dutt Univ. of Illinois at Chicago An example of an SPMD message-passing parallel program 2 SPMD message-passing parallel program (contd.) node xor D, 3 1 Reduction Computations & Their Parallelization The prior max computation is a reduction computation, defined as x = f(D), where D is a data set (e.g., a vector), and x is a scalar quantity. In the max computation, f is the max function, and D the set/vector of numbers for which the max is to be computed. Reduction computations that are associative [defined as f(a,b,c) = f(f(a,b), c) = f(a, f(b,c))], can be easily parallelized using a recursive reduction approach (the final value of f(D) needs to be at some processor at the end of the parallel computation): The data set D is evenly distributed among P processors—each processor Pi has a disjoint subset Di of D/P data elements. Each processor Pi performs the computation f(Di) on its data set Di. Due to associativity, the final result = f(f(D1), f(D2), …., f(DP)). This is achieved via communication betw. the processors and f computation of successively larger f(Di) sets. A natural communication pattern is that of a binary tree. But a more general pattern is described below. Each processor then engages in (log P) rounds of message passing with some other processors. In the k’th round Pi communicates with a unique partner processor Pj = partner(Pi, k) in which it sends or receives (depending, say, on whether its id is > or < than Pj, resp.) the current f computation result it or Pj contains, resp. If Pi receives a computation result b from Pj in the k’th round, it computes a = f(a,b), where a is its current result, and participates in the (k+1)’th round of commun. If Pi has sent its data to Pj, then it does not participate in any further rounds of communication and computation; it is done with its task. At the end of the (log P) rounds of communication, the processor with the least ID (= 0) will hold f(D). Reduction Computations & Their Parallelization (contd.) Assuming (Pi, Pj), where Pj = partner(Pi, k), is a unique send-recv pair (the pairs may or may not be disjoint from each other) in round k, the # of processors holding the required partial computation results halve after each round, and hence the term recursive halving for such parallel computations. In general, there are other variations of parallel reduction computations (generally dictated by the interconnection topology) in which the # of processors will reduce by some factor other than 2 in each round of communication. The general term for such parallel computations is recursive reduction. A topology independent recursive-halving communication pattern is shown below. Note also that as the # of processors involved halve, the # of initial data sets that each “active” partial result represents/covers double (a recursive doubling of coverage of data sets by each active partial result). Total # of msgs sent is P-1 (Why?). Basic metrics of performance we will use initially: parallel time, speedup (seq. time/parallel time), efficiency (speedup/P = # of procs.), total number or size of msgs Time step 1 0 Time step 1 1 2 3 4 5 Time step 2 Time step 2 Time step 3 Time step 1 Time step 1 6 7 Reduction Computations & Their Parallelization (contd.) A variation of a reduction computation is one in which each processor needs to have the f(D) value. In this case, instead of a send from Pj to its partner in the k’th round Pi (assuming, id(Pi) < id(Pj)), Pi and Pj would exchange their respective data elements a, b, each compute f(a,b) and each engage in the (k+1)’th round of communication with different partners. Thus there is no recursive-halving of the # of processors involved in each subsequent round (the # of participating processors always remains P). However, the # of initial data sets that each active partial result covers recursively doubles in each round as in the recursive-halving computation. The “exchange” communication pattern for a reduction computation is shown below (total # of msgs sent is P(log P)). Time step 1 0 Time step 1 1 2 2 2 3 Time step 1 Time step 1 3 3 4 3 5 6 2 2 3 7 Analysis of Parallel Reduction on Different Topologies Recursive halving based reduction on a hypercube: Initial computation time = Theta(N/P); N= # of data items, P = # processors. Communication time = Theta(log P), as there are (log P) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1-hop msg., and there is no conflict among msgs Computation time during commun. rounds = Theta(log P) [1 red. oper. in each processor in each round). Note that computation and commun. are sequentialized (e.g., no overlap betw. them), and so need to take the sum of both times to obtain total time. Same comput. and commun. time for exchange commun. on a hypercube Speedup = S(P) = Seq._time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(2*logP))] ~ Theta(P) if N >> P Time steps 3 2 2 3 3 1 1 2 1 1 1 2 1 1 1 2 3 3 2 (a) Hypercubes of dimensions 0 to 4 (b) Msg pattern for a reduction comput. using recursive halving; processor 000 will hold the final result (c) Msg pattern for a reduction comput. using exchange communication; all processors will hold the final result Analysis of Parallel Reduction on Different Topologies (contd). Recursive reduction on a direct tree: Initial Computation time = Theta(N/P); N= # of data items, P = # processors. Communication time = Theta((log (P/2)), as there are (log ((P+1)/2)) msg passing rounds, in each round all msgs are sent in parallel, each msg is a 1hop msg., and there is no conflict among msgs; Computation time during commun. rounds = Theta(2*(log (P/2)) [2 red. opers. in the “parent” processor in each round) = Theta(2*(log P)). Again comput. and commun.not overlapped. Speedup = S(P) = Seq_time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta(3*logP) )]~ Theta(P) if N >> P Time steps 2 1 2, 4 2 1 1 Round #, Hops 1 1, 2 1, 2 Recursive reduction in (a) a direct tree network; and (b) an indirect tree network. Analysis of Parallel Reduction on Different Topologies (contd). Recursive reduction on an indirect tree: Initial Computation time = Theta(N/P); N= # of data items, P = # processors. Communication time = 2 + 4 + 6 +…. + 2*(log P) = [(2*((log P)(log P +1)/2)) = (log P)((log P) +1), as there are (log P) msg passing rounds, in round k all msgs are sent in parallel, each msg is a (2*k)-hop msg., and there is no conflict among msgs; Computation time during commun. rounds = Theta(log P) [1 red. opers. in each receiving processor in each round) = Theta(log P) Speedup = S(P) = Seq_time/Parallel_time(P) = Theta(N)/[Theta((N/P) + Theta((log P)^2) + Theta(logP)] ~ Theta(P)) if N >> P Time steps 2 1 2, 4 2 1 1 Round #, Hops 1 1, 2 1, 2 Recursive reduction in (a) a direct tree network; and (b) an indirect tree network.