Lecture 11: Parallel Processing of Irregular Computations & Load Balancing Shantanu Dutt ECE Dept. UIC Discrete Event Simulation— Basics with VHDL Descriptions as an Example. VHDL Dataflow Description of a Circuit: Library IEEE; use IEEE.STD_LOGIC_1164.all; entity ckt1 is port(s1,s2:in bit; Z:out bit); end entity ckt1; architecture data_flow of ckt1 is signal sbar1,sbar2,x,y:bit; begin sbar1 <= not s1 after 2 ns; sbar2 <= not s2 after 2 ns; x <= s1 and sbar2 after 4 ns; y <= s2 and sbar1 after 4 ns; Z <= x or y after 4 ns; end architecture data_flow; Discrete Event Simulation—Basics Discrete Event Simulation—Basics (cont’d) Discrete Event Simulation—Basics (cont’d) Parallel DES for Logic Simulation Correctness Issues in Parallel DES • What happens is inter-processor messages are received out of simulation time order, either from the same processor or from different processors? In other words, if a msg. w/ simulation time ti is received before a msg. w/ simulation time tj, where ti > tj, then what happens? The sim. time ti and tj msgs. could be coming from the same or different processors • If a proc. “blindly” processes all msgs. as they come, then this can lead to incorrect simulation. E.g., the sim. time tj msg. can cause an output that affects the input to the process for the sim. time ti msg. in the above example. So if the earlier arriving sim. time ti msg. is processed before the later arriving sim. time tj msg., the former simulation output will likely be incorrect. Correctness Issues in Parallel DES: Solutions • For each msg. sent from processor Pk targeting a (simulation) process Qr (which is, say, in processor Pq), Pk records the sim. time tq of the latest such msg. When sending the next msg. targeting Qr, Pk also mentions the prev. sim. time along w/ the current one tj. • So the msg. data looks like Mj = (input value, tj [curr. sim. time], tq [prev. sim. time]) • The next msg. Mi = (input value, ti, tj) • Receiving proc. Pq also records the sim. time tq of the last msg. received for each input of Qr. If a new msg. meant for that input of Qr has the prev. sim. time the same as that it has recorded, then that msg. is correct in terms of timing order. Otherwise, Pq will store the msg. but wait for a previous msg. of correct timing order that it has not yet recvd. • So if for the input, say, A, of Qr, the recorded time of the prev. simulation is tq, and msg. Mi =(value, ti, tj) is recvd. it will not be processed. Only after msg. Mj =(value, tj, tq) is recvd. it will be processed, followed by the processing of msg. Mi (since the latest recorded sim. time for i/p A of Qr is tj). • With regards to msg. from multiple processors, Pq will not perform any simulation until it has recvd. timing-correct msgs (e.g., Mj above) from all procs. supposed to send it msgs. This issue underscores the imp. of null msgs. w/o which simulation will not proceed further in this aforementioned approach Some examples of applications requiring DES Search Techniques 1 A B B C B C 9 E 10 D D F Graph 2 E 8 G 3G D 7 5 5 4 F Soln found (A,B,E,C,F) that meets some criterion DFS (black arcs) and Soln_DFS (black+red arcs) dfs(v) /* for basic graph visit or for soln finding when nodes are partial or full solns */ v.mark = 1; for each (v,u) in E if (u.mark != 1) then dfs(u) Algorithm Depth_First_Search_Soln for each v in V v.mark = 0; if G has partial soln nodes then for each v in V if v.mark = 0 then dfs(v); end for; else soln_dfs(root); /* root is a particular node in V from were we can start the solution search */ C 4 6 2 3 1 A A 6 E G 7 F BFS soln_dfs(v) /* used when nodes are basic elts of the problem and not partial soln nodes, and a soln. is a path */ v.mark = 1; If path to v is a soln, then return(1); for each (v,u) in E if (u.mark != 1) then soln_found = soln_dfs(u) if (soln_found = 1) then return(soln_found) end for; v.mark = 0; /* can visit v again to form another soln on a different path */ return(0) Search Techniques—Exhaustive DFS optimal_soln_dfs(v) /* used when nodes are basic elts of the problem and not partial soln 1 A i > 10 nodes, and a soln. is a path */ begin B v.mark = 1; i+1 C If path to v is a soln, then begin 9 6 Best soln. so if cost < best_cost then begin 2 far (A,C,E,D,F,G) best_soln=soln; best_cost=cost; E i+2 8 G endif 10 3 D v.mark=0; return; 7 4 Endif 5 F i+3 for each (v,u) in E i+4 Soln found if (u.mark != 1) then (A,B,E,C,F) cost = cost + edge_cost(v,u); /* global var. */ DFS (black arcs) optimal_soln_dfs(u) and Soln_DFS (black+red arcs) Optimal_Soln_DFS end for; (black+red+green) arcs v.mark = 0; /* can visit v again to form another soln on a different path */ end Algorithm Depth_First_Search_Opt_Soln for each v in V v.mark = 0; best_cost = infinity; cost = 0; optimal_soln_dfs(root); Y = partial soln. = a path from root to current “node” (a basic elt. of the problem, e.g., a city in TSP, a vertex in V0 or V1 in min-cut partitioning). We go from each such “node” u to the next one u that is “reachable “ from u in the problem “graph” (which is part of what you have to formulate) root u 10 costs (1) 12 15 19 (2) 16 18 18 17 (3) Expand_&_est_cost(Y) begin children = nullset; for each basic elt x of problem “reachable” from Y do begin if x not in Y and if feasible child = Y U {x}; path_cost(child) = path_cost(Y) + cost(Y, x) /* cost(Y, x) is cost of reaching x from Y */ est(child) = lower bound cost of best soln reachable from child; cost(child) = path_cost(child) + est(child); children = children U {child}; endfor end /* Expand_&_est_cost(Y); Best-First Search BeFS (root) begin open = {root} /* open is list of gen. but not expanded nodes—partial solns */ best_soln_cost = infinity; while open != nullset do begin curr = first(open); if curr is a soln then return(curr) /* curr is an optimal soln */ else children = Expand_&_est_cost(curr); /* generate all children of curr & estimate their costs---cost(u) should be a lower bound of cost of the best soln reachable from u */ for each child in children do begin if child is a soln then delete all nodes w in open s.t. cost(w) >= cost(child); endif store child in open in increasing order of cost; endfor endwhile end /* BFS */ Best-First Search Y = partial soln. root u 10 costs (1) 12 15 19 (2) 16 18 18 17 (3) Proof of optimality when cost is a LB • The current set of nodes in “open” represents a complete front of generated nodes, i.e., the rest of the in-generated nodes in the search space are descendants of “open” • If first node curr in “open” is a soln, then cost(curr) <= cost(w) for each w in “open” • Cost of any solution node in the search space not in “open” and not yet generated is >= cost of its ancestor in “open” and thus >= cost(curr). Thus curr is the optimal (min-cost) soln Search techs for a TSP example 9 A B A 5 A 4 3 5 8 5 E B F C E F F C 7 1 D F 2 D E F D E F E D TSP graph E x A A 27 31 33 A Solution nodes Exhaustive search using DFS (w/ backtrack) for finding an optimal solution Search techs for a TSP example (contd) A A 9 B 5 A E B F 4 5+15 3 5 C F F 8+16 D 8 5 E 21+6 F 7 F C D E F C 11+14 22+9 1 Path cost for (A,E,F) = 8 D C 2 D MST for node (A, E, F); = MST{F,A,B,C,D}; cost=16 • Lower-bound cost estimate: MST({unvisited cities} U {current city} U {start city}) • LB as structure (spanning tree) is a superset of reqd soln structure (cycle) • min(metric M’s values in set S) <= min(M’s values in subset S’) • Similarly for max?? E F 23+8 C E D X B F 14+9 X X F F A A 27 20 BeFS for finding an optimal TSP solution Set S of all spanning trees in a graph G S S’ Set S’of all Hamiltonian paths (that visits a node exactly once)in a graph G BFS for 0/1 ILP Solution Cost relations: C5 < C3 < C1 < C6 C2 < C1 C4 < C3 root (no vars exp.) X2=1 X2=0 • X = {x1, …, xm} are 0/1 vars • Choose vars Xi=0/1 as next nodes in some order (random or heuristic based) Solve LP w/ x2=1; Cost=cost(LP)=C2 Solve LP w/ x2=0; Cost=cost(LP)=C1 X4=0 X4=1 Solve LP w/ x2=1, x4=1; Cost=cost(LP)=C4 Solve LP w/ x2=1, x4=0; Cost=cost(LP)=C3 X5=0 Solve LP w/ x2=1, x4=1, x5=0 Cost=cost(LP)=C5 X5=1 Solve LP w/ x2=1, x4=1, x5=1 Cost=cost(LP)=C6 optimal soln (stop when child gen. is a soln. node that is at most (1+alpha)*cost(best(open)), alpha is given sub-opt. fraction. for speedup > 1 for speedup > 1 • For Sp(P) > 1, we need n*texp/((n/P)*(texp+(P-1)*tacc)) > 1 texp > texp/P + (P-1)*tacc texp(P-1)/P > (P-1)*tacc P < texp / tacc • For constant efficiency, this is even worse: E(P)=Sp(P)/P = T(1)/(Tp(P)*P) = n*texp/(P*(n/P)*(texp+(P-1)*tacc)) = n*texp/(n*texp+ n*(P/(P-1))*tacc) = const. C <= 1 1 + ((P-1)/P)*tacc/texp) = 1/C ((P-1)/P)*tacc/texp = 1/C – 1 Differentiating both sides wrt P to minimize the expr. (max. C), we get: (tacc/texp )/P2) = 0, which cannot occur for any P. • • Nodes w/ cost >= the current best global soln. so far are discarded. Note that this can sometimes lead to idling, and at other times non-essential work can be done before such deletion of nodes take place. Both are overheads of parallel B&B. A local best soln. @ head of local open is global opt. if all other processors have terminated by then (their termn. msg. may be in transit in some cases) Load Balancing Legend: Load info exchange LIE Load/work transfer • Generic Load Balance protocol ‒ Periodic LIEs between subsets of processors (generally, neighbors or small extended neighborhoods, e.g., distance k apart for small k) ‒ Followed by work transfers as indicated by the LIE and work transfer policy • Issues to be determined in a LB technique (generally application and parallel system dependent): ‒ Frequency of LIE ‒ Definition of load ‒ Load difference threshold or in general some relative load condition criteria to trigger work transfer ‒ Donor or receiver initiated load/work transfer? ‒ How much and which work to transfer? : minimizes non-essential work but significantly increases idling due to large taccess/texp without a numerical load computation based on rank (a la the AC method) Quality Equalizing (QE) Load Balancing Techniques • Various techniques developed by my former Ph.D. student Prof. Nihar Mahaptra (MSU) and myself over a few years. The refs are: • N.R. Mahapatra and S. Dutt, ``An efficient delay-optimal distributed termination detection algorithm'', Jour. Parallel and Distr. Computing , Oct. 2007, pp. 1047-1066. • N.R. Mahapatra and S. Dutt, ``Adaptive Quality Equalizing: High-Performance Load Balancing for Parallel Branch-andBound Across Applications and Computing Systems'', Proc. Joint IEEE Parallel Processing Symposium/ Symp. on Parallel and Distr. Processing , April 1998. • N.R. Mahapatra and S. Dutt, ``Random Seeking: A General, Efficient, and Informed Randomized Scheme for Dynamic Load Balancing'', Proc. Tenth IEEE Parallel Processing Symposium, April 1996, pp. 881-885. • N.R. Mahapatra and S. Dutt, ``New anticipatory load balancing strategies for scalable parallel best-first search'', American Mathematical Society's DIMACS Series on Discrete Mathematics and Theoretical Computer Science, Vol. 22, 1995, pp. 197-232. S. Dutt and N.R. Mahapatra, ``Scalable load-balancing strategies for parallel A* algorithms'', Special Issue on Scalability of Parallel Algorithms and Architectures Journal of Parallel and Distr. Computing, Vol. 22, No. 3, Sept. 1994, pp. 488-505. • S. Dutt and N.R. Mahapatra, ``Parallel A* algorithms and their performance on hypercube multiprocessors'',Proc. Seventh IEEE Parallel Processing Symposium, 1993, pp. 797803. • The donor processor grants very few nodes to acceptor (e.g., alternating 2-3 nodes starting from local rank 2 node) • For high-latency low-bw platforms like NOWs (n/w of workstations and Beowulf clusters like Argo): – – set s higher (should be inversely proportional to bw , otherwise n/w saturation can occur) decrease frequency of load info exchange (LIE) (alternating rank nodes in merged open list) for s > 1 . Will see worst-case analysis later in this regard. [E = T1/PTp(P) = W(N)/Wp(P) = W(N)/(W(N) + Wo(N, P)), N is problem size) Scalability Analysis • Derivation of QE’s isoefficieny upper-bound of Q(PDd) Pi,1 Pi,2 Best node rank wrt Pi,1’s = Q (sd) Pi,3 Best node rank wrt Pi,2’s = Q (sd) and wrt Pi,1’s = Q (2sd) Best node rank wrt Pi,D-1’s = Q (sd) and wrt Pi,1’s = Q ((D-1)sd) = Worst-case assumption (for worst-case rank difference): each proc. is worst in its neighborhood, and its neighbor on this path is best in my neighborhood proc. w/ best node opt. soln. in worst case for isoeff. proc. w/ worst node Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes opt. cost Pi,D-1 Pi,D opt. cost proc. w/ opt. node proc. w/ worst node Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes Scalability Analysis • Derivation of QE’s isoefficieny upper-bound of Q(PDd) proc. w/ best node opt. soln. in worst case for isoeff. • • Taking the fact that d other such paths of “neighbors” of the 1st path, the rank difference among d such paths of length about D is also Q (Dsd) (the Q (sd) rank gap between neighboring processors on a path encompasses the rank difference w/ the other (d-1) other neighbors, one each in the “neighboring” (d-1) paths of length about D) = Q ((Dd) (s=const). After Q ((Dsd) = Q ((Dd) iterations proc. w/ best node produces the optimal solution. In this time, Q ((Dd)2/2 ) nonessential (NE) works get done in a group of d neighborhood paths of distance about D. This happens across Q (P/dD) such path groups total NE work across P procs. = Q ((P/Dd)*(Dd)2) Q (PDd) NE work or idling. Q ((D-1)sd) rank proc. w/ worst node gap w/ only 1 or few best proc. w/ essential nodes opt. cost opt. cost proc. w/ opt. node proc. w/ worst node Q ((D-1)sd) rank gap w/ only 1 or few best proc. w/ essential nodes Q(3tc + 3ts/2)/texp) to be precise Q(2tc + ts)/texp) to be precise (texp is a constant wrt arch.) . Rationale: More global load balancing ( smaller global rank difference betw. best and worst qualitatively loaded processors) w/o high commun. overhead s (costk(1) > costi(3)