ppt

advertisement
Lecture 11: Parallel
Processing of Irregular
Computations & Load
Balancing
Shantanu Dutt
ECE Dept.
UIC
Discrete Event Simulation—
Basics with VHDL Descriptions as
an Example.
VHDL Dataflow Description of a Circuit:
Library IEEE; use
IEEE.STD_LOGIC_1164.all;
entity ckt1 is port(s1,s2:in bit; Z:out bit);
end entity ckt1;
architecture data_flow of ckt1 is
signal sbar1,sbar2,x,y:bit;
begin sbar1 <= not s1 after 2 ns;
sbar2 <= not s2 after 2 ns;
x <= s1 and sbar2 after 4 ns;
y <= s2 and sbar1 after 4 ns;
Z <= x or y after 4 ns;
end architecture data_flow;
Discrete Event Simulation—Basics
Discrete Event Simulation—Basics (cont’d)
Discrete Event Simulation—Basics (cont’d)
Parallel DES for Logic Simulation
Correctness Issues in Parallel DES
• What happens is inter-processor messages are
received out of simulation time order, either from
the same processor or from different processors?
In other words, if a msg. w/ simulation time ti is
received before a msg. w/ simulation time tj,
where ti > tj, then what happens? The sim. time ti
and tj msgs. could be coming from the same or
different processors
• If a proc. “blindly” processes all msgs. as they
come, then this can lead to incorrect simulation.
E.g., the sim. time tj msg. can cause an output
that affects the input to the process for the sim.
time ti msg. in the above example. So if the earlier
arriving sim. time ti msg. is processed before the
later arriving sim. time tj msg., the former
simulation output will likely be incorrect.
Correctness Issues in Parallel DES: Solutions
• For each msg. sent from processor Pk targeting a
(simulation) process Qr (which is, say, in processor
Pq), Pk records the sim. time tq of the latest such
msg. When sending the next msg. targeting Qr, Pk
also mentions the prev. sim. time along w/ the
current one tj.
• So the msg. data looks like Mj = (input value, tj
[curr. sim. time], tq [prev. sim. time])
• The next msg. Mi = (input value, ti, tj)
• Receiving proc. Pq also records the sim. time tq of
the last msg. received for each input of Qr. If a new
msg. meant for that input of Qr has the prev. sim.
time the same as that it has recorded, then that msg.
is correct in terms of timing order. Otherwise, Pq
will store the msg. but wait for a previous msg. of
correct timing order that it has not yet recvd.
• So if for the input, say, A, of Qr, the recorded time of
the prev. simulation is tq, and msg. Mi =(value, ti, tj)
is recvd. it will not be processed. Only after msg. Mj
=(value, tj, tq) is recvd. it will be processed, followed
by the processing of msg. Mi (since the latest
recorded sim. time for i/p A of Qr is tj).
• With regards to msg. from multiple processors, Pq
will not perform any simulation until it has recvd.
timing-correct msgs (e.g., Mj above) from all procs.
supposed to send it msgs. This issue underscores the
imp. of null msgs. w/o which simulation will not
proceed further in this aforementioned approach
Some examples of applications requiring DES
Search Techniques
1
A
B
B
C
B
C
9
E
10
D
D
F
Graph
2
E
8
G
3G
D
7
5
5
4
F
Soln found (A,B,E,C,F)
that meets some criterion
DFS (black arcs)
and Soln_DFS (black+red arcs)
dfs(v) /* for basic graph visit or
for soln finding when nodes are
partial or full solns */
v.mark = 1;
for each (v,u) in E
if (u.mark != 1) then
dfs(u)
Algorithm
Depth_First_Search_Soln
for each v in V
v.mark = 0;
if G has partial soln nodes then
for each v in V
if v.mark = 0 then dfs(v);
end for;
else soln_dfs(root); /* root is a
particular node in V from were
we can start the solution search
*/
C
4
6
2
3
1 A
A
6
E
G
7
F
BFS
soln_dfs(v)
/* used when nodes are basic
elts of the problem and not
partial soln nodes, and a soln. is
a path */
v.mark = 1;
If path to v is a soln, then
return(1);
for each (v,u) in E
if (u.mark != 1) then
soln_found = soln_dfs(u)
if (soln_found = 1) then
return(soln_found)
end for;
v.mark = 0; /* can visit v again
to form another soln on a
different path */
return(0)
Search Techniques—Exhaustive DFS
optimal_soln_dfs(v)
/* used when nodes are basic elts of
the problem and not partial soln
1 A
i > 10
nodes, and a soln. is a path */
begin
B
v.mark = 1;
i+1 C
If path to v is a soln, then begin
9
6
Best soln. so
if cost < best_cost then begin
2
far (A,C,E,D,F,G)
best_soln=soln; best_cost=cost;
E
i+2 8
G
endif
10 3
D
v.mark=0; return;
7
4
Endif
5 F
i+3
for each (v,u) in E
i+4
Soln
found
if (u.mark != 1) then
(A,B,E,C,F)
cost = cost + edge_cost(v,u); /*
global var. */
DFS (black arcs)
optimal_soln_dfs(u)
and Soln_DFS (black+red arcs)
Optimal_Soln_DFS
end for;
(black+red+green) arcs
v.mark = 0; /* can visit v again to
form another soln on a different
path */
end
Algorithm
Depth_First_Search_Opt_Soln
for each v in V
v.mark = 0;
best_cost = infinity; cost = 0;
optimal_soln_dfs(root);
Y = partial soln. = a path from root to current
“node” (a basic elt. of the problem, e.g., a city in
TSP, a vertex in V0 or V1 in min-cut partitioning).
We go from each such “node” u to the next one u
that is “reachable “ from u in the problem “graph”
(which is part of what you have to formulate)
root
u
10
costs
(1) 12
15
19
(2)
16
18
18
17
(3)
Expand_&_est_cost(Y)
begin
children = nullset;
for each basic elt x of problem
“reachable” from Y do begin
if x not in Y and if feasible
child = Y U {x};
path_cost(child) = path_cost(Y) +
cost(Y, x)
/* cost(Y, x) is cost of reaching x
from Y */
est(child) = lower bound cost of best
soln reachable from child;
cost(child) = path_cost(child) +
est(child);
children = children U {child};
endfor
end /* Expand_&_est_cost(Y);
Best-First
Search
BeFS (root)
begin
open = {root} /* open is list of gen.
but not expanded nodes—partial
solns */
best_soln_cost = infinity;
while open != nullset do begin
curr = first(open);
if curr is a soln then return(curr) /*
curr is an optimal soln */
else children =
Expand_&_est_cost(curr);
/* generate all children of curr &
estimate their costs---cost(u) should
be a lower bound of cost of the best
soln reachable from u */
for each child in children do begin
if child is a soln then
delete all nodes w in open s.t.
cost(w) >= cost(child);
endif
store child in open in increasing
order of cost;
endfor
endwhile
end /* BFS */
Best-First Search
Y = partial soln.
root
u
10
costs
(1)
12
15
19
(2)
16
18
18
17
(3)
Proof of optimality when cost
is a LB
• The current set of nodes in
“open” represents a complete
front of generated nodes, i.e.,
the rest of the in-generated
nodes in the search space are
descendants of “open”
• If first node curr in “open” is
a soln, then cost(curr) <=
cost(w) for each w in “open”
• Cost of any solution node in
the search space not in “open”
and not yet generated is >=
cost of its ancestor in “open”
and thus >= cost(curr). Thus
curr is the optimal (min-cost)
soln
Search
techs for a TSP example
9
A
B
A
5
A
4
3
5
8
5
E
B
F
C
E
F
F
C
7
1
D
F
2
D
E
F D
E
F
E
D
TSP graph
E
x
A
A
27
31 33
A
Solution nodes
Exhaustive search using DFS (w/ backtrack) for finding
an optimal solution
Search techs for a TSP example
(contd)
A
A
9
B
5
A
E
B
F
4
5+15
3
5
C
F
F 8+16
D
8
5
E
21+6
F
7
F C
D
E
F
C
11+14
22+9
1
Path cost for
(A,E,F) = 8
D
C
2
D
MST for node (A, E, F); =
MST{F,A,B,C,D}; cost=16
• Lower-bound cost
estimate:
MST({unvisited cities} U
{current city} U {start
city})
• LB as structure
(spanning tree)
is a superset of reqd soln
structure
(cycle)
• min(metric M’s values
in set S)
<= min(M’s values in
subset S’)
• Similarly for max??
E
F
23+8
C E D
X
B F
14+9
X X
F
F
A
A
27
20
BeFS for finding an optimal
TSP solution
Set S of all spanning
trees in a graph G
S
S’
Set S’of all Hamiltonian
paths (that visits a node
exactly once)in a graph G
BFS for 0/1 ILP Solution
Cost relations:
C5 < C3 < C1 < C6
C2 < C1
C4 < C3
root
(no vars
exp.)
X2=1
X2=0
• X = {x1, …, xm} are
0/1 vars
• Choose vars Xi=0/1
as next nodes in
some order (random
or heuristic based)
Solve LP
w/ x2=1;
Cost=cost(LP)=C2
Solve LP
w/ x2=0;
Cost=cost(LP)=C1
X4=0
X4=1
Solve LP
w/ x2=1, x4=1;
Cost=cost(LP)=C4
Solve LP
w/ x2=1, x4=0;
Cost=cost(LP)=C3
X5=0
Solve LP
w/ x2=1, x4=1, x5=0
Cost=cost(LP)=C5
X5=1
Solve LP
w/ x2=1, x4=1, x5=1
Cost=cost(LP)=C6
optimal soln
(stop when child gen. is a soln. node that is at most
(1+alpha)*cost(best(open)), alpha is given sub-opt. fraction.
for speedup > 1
for speedup > 1
•
For Sp(P) > 1, we need n*texp/((n/P)*(texp+(P-1)*tacc)) > 1
 texp > texp/P + (P-1)*tacc  texp(P-1)/P > (P-1)*tacc  P < texp / tacc
•
For constant efficiency, this is even worse:
E(P)=Sp(P)/P = T(1)/(Tp(P)*P) = n*texp/(P*(n/P)*(texp+(P-1)*tacc))
= n*texp/(n*texp+ n*(P/(P-1))*tacc) = const. C <= 1
 1 + ((P-1)/P)*tacc/texp) = 1/C  ((P-1)/P)*tacc/texp = 1/C – 1
Differentiating both sides wrt P to minimize the expr. (max. C), we get:
(tacc/texp )/P2) = 0, which cannot occur for any P.
•
•
Nodes w/ cost >= the current best global soln. so far are
discarded. Note that this can sometimes lead to idling, and at
other times non-essential work can be done before such deletion
of nodes take place. Both are overheads of parallel B&B.
A local best soln. @ head of local open is global opt. if all other
processors have terminated by then (their termn. msg. may be in
transit in some cases)
Load Balancing
Legend:
Load info exchange LIE
Load/work transfer
• Generic Load Balance protocol
‒ Periodic LIEs between subsets of processors (generally,
neighbors or small extended neighborhoods, e.g., distance k
apart for small k)
‒ Followed by work transfers as indicated by the LIE and
work transfer policy
• Issues to be determined in a LB technique
(generally application and parallel system
dependent):
‒ Frequency of LIE
‒ Definition of load
‒ Load difference threshold or in general some relative load
condition criteria to trigger work transfer
‒ Donor or receiver initiated load/work transfer?
‒ How much and which work to transfer?
: minimizes non-essential work but
significantly increases idling due to large taccess/texp
without a numerical load computation
based on rank (a la the AC method)
Quality Equalizing (QE) Load
Balancing Techniques
• Various techniques developed by my former Ph.D. student
Prof. Nihar Mahaptra (MSU) and myself over a few years. The
refs are:
• N.R. Mahapatra and S. Dutt, ``An efficient delay-optimal
distributed termination detection algorithm'', Jour. Parallel
and Distr. Computing , Oct. 2007, pp. 1047-1066.
• N.R. Mahapatra and S. Dutt, ``Adaptive Quality Equalizing:
High-Performance Load Balancing for Parallel Branch-andBound Across Applications and Computing Systems'', Proc.
Joint IEEE Parallel Processing Symposium/ Symp. on
Parallel and Distr. Processing , April 1998.
• N.R. Mahapatra and S. Dutt, ``Random Seeking: A General,
Efficient, and Informed Randomized Scheme for Dynamic
Load Balancing'', Proc. Tenth IEEE Parallel Processing
Symposium, April 1996, pp. 881-885.
• N.R. Mahapatra and S. Dutt, ``New anticipatory load
balancing strategies for scalable parallel best-first
search'', American Mathematical Society's DIMACS Series on
Discrete Mathematics and Theoretical Computer Science, Vol.
22, 1995, pp. 197-232.
S. Dutt and N.R. Mahapatra, ``Scalable load-balancing
strategies for parallel A* algorithms'', Special Issue on
Scalability of Parallel Algorithms and Architectures Journal of
Parallel and Distr. Computing, Vol. 22, No. 3, Sept. 1994, pp.
488-505.
• S. Dutt and N.R. Mahapatra, ``Parallel A* algorithms and
their performance on hypercube multiprocessors'',Proc.
Seventh IEEE Parallel Processing Symposium, 1993, pp. 797803.
• The donor processor grants very few nodes to
acceptor (e.g., alternating 2-3 nodes starting
from local rank 2 node)
• For high-latency low-bw platforms like NOWs
(n/w of workstations and Beowulf clusters like
Argo):
–
–
set s higher (should be inversely proportional to
bw , otherwise n/w saturation can occur)
decrease frequency of load info exchange (LIE)
(alternating rank nodes in merged open list) for s > 1
. Will see worst-case analysis later in this regard.
[E = T1/PTp(P) = W(N)/Wp(P)
= W(N)/(W(N) + Wo(N, P)), N is problem size)
Scalability Analysis
• Derivation of QE’s isoefficieny upper-bound of Q(PDd)
Pi,1
Pi,2
Best node rank
wrt Pi,1’s = Q (sd)
Pi,3
Best node rank
wrt Pi,2’s = Q (sd)
and wrt Pi,1’s
= Q (2sd)
Best node rank wrt
Pi,D-1’s = Q (sd) and
wrt Pi,1’s
= Q ((D-1)sd) =
Worst-case
assumption (for
worst-case rank
difference): each
proc. is worst in its
neighborhood, and
its neighbor on this
path is best in my
neighborhood
proc. w/ best
node  opt.
soln. in worst
case for isoeff.
proc. w/
worst node
Q ((D-1)sd) rank
gap w/ only 1 or
few best proc. w/
essential nodes
opt. cost
Pi,D-1
Pi,D
opt. cost
proc. w/
opt. node
proc. w/
worst node
Q ((D-1)sd) rank
gap w/ only 1 or
few best proc. w/
essential nodes
Scalability Analysis
• Derivation of QE’s isoefficieny upper-bound of Q(PDd)
proc. w/ best
node  opt.
soln. in worst
case for isoeff.
•
•
Taking the fact that d other
such paths of “neighbors” of
the 1st path, the rank difference
among d such paths of length
about D is also Q (Dsd) (the Q
(sd) rank gap between
neighboring processors on a
path encompasses the rank
difference w/ the other (d-1)
other neighbors, one each in
the “neighboring” (d-1) paths
of length about D) = Q ((Dd)
(s=const).
After Q ((Dsd) = Q ((Dd)
iterations proc. w/ best node
produces the optimal solution.
In this time, Q ((Dd)2/2 ) nonessential (NE) works get done
in a group of d neighborhood
paths of distance about D. This
happens across Q (P/dD) such
path groups  total NE work
across P procs.
= Q ((P/Dd)*(Dd)2)  Q (PDd)
NE work or idling.
Q ((D-1)sd) rank
proc. w/
worst
node
gap w/ only 1 or
few best proc. w/
essential nodes
opt. cost
opt. cost
proc. w/
opt. node
proc. w/
worst node
Q ((D-1)sd) rank
gap w/ only 1 or
few best proc. w/
essential nodes
Q(3tc + 3ts/2)/texp) to be precise
Q(2tc + ts)/texp) to be precise
(texp is a constant wrt arch.)
. Rationale: More global load balancing
( smaller global rank difference betw. best and worst qualitatively loaded processors) w/o high commun. overhead
s
(costk(1) > costi(3)
Download