Consistent Cuts and Un-coordinated Check-pointing

advertisement
Consistent Cuts
and
Un-coordinated Check-pointing
Cuts
e1
e0
e2
x
x
e4
x
e6
x
x
e8
x
e9
x
x
e11
x
e5
x
e7
e3
e10
x
x
e12
x
x
e13
• Subset C of events in computation
– some definitions require at least one event from each process
• For each process P, events in C that executed on P form an
initial prefix of all events that executed on P
• Cut: {e0,e1,e2,e4,e7} Not a cut: {e0,e2,e4,e7}
• Frontier of cut: subset of cut containing last events on each
process
– for our example, {e2,e4,e7}
Equivalent definition of cut
e1
e0
e2
x
x
e4
x
e6
x
x
e8
x
e9
x
x
e11
x
e5
x
e7
e3
e10
x
x
e12
x
x
e13
• Subset C of events in computation
• If e’ e C, and e  e’, and e and e’ executed on same
process, then e e C.
• What happens if we remove condition that e and e’ were
executed on same process?
Consistent cut
e1
e0
e2
x
x
e4
x
e6
x
x
e8
e9
x
x
e11
x
e5
x
xe7
e3
e10
x
x
e12
x
x
e13
• Subset C of events in computation
• If e’ e C, and e  e’, then e e C
– Consistent cut: {e0, e1, e2, e4, e5,e7}
• note e5e2 but cut is still consistent by our definition
– Inconsistent cut: {e0,e1,e2,e4,e7}
– Not a cut: {e0,e2,e4,e7}
Properties of consistent cuts(0)
e0
e
x
x
x
e4
e6
e5
x
x
x
e8
xe7
e9
x
x
x
e10
x
x
x
x
x
e’
• If cut is inconsistent, there must be a message such that
receiving event is in C but sending event is not.
• Proof: there must an e and e’ such ee’, e’ in C but e not
in C. Consider the chain ee0e1…e’. There must be
events eiej in this chain such that events e,e0,…ei are
not in C, but ej is in C. Clearly, ei and ej must be executed
by different processes. Therefore, ei is send and ej is
receive.
e11
e12
e13
Properties of consistent cuts(I)
e0
x
x
e4
x
e6
e5
x
x
x
e8
xe7
e9
x
x
e11
x
e10
x
x
e12
x
x
e13
• Let e P be a computational event on a frontier of a
consistent cut C. If e P  e’Q , then e’Q cannot be in C.
• Proof: Consider the causal chain e P  e1… e’Q.
Event e1 must execute on process P because e P is a
computational event. If e P is on frontier, e1 is not. By
definition of consistent cut, e’Q cannot be in consistent cut.
Properties (II)
e0
x
x
e4
x
e6
e5
x
x
x
e8
xe7
e9
x
x
e11
x
e10
x
x
e12
x
x
e13
• Let F = {e0,e1,….} be a set of computational
events, one from each process. F is the frontier of
a consistent cut iff the events in F are concurrent.
• Proof: from Property (I) and Property(0).
Properties of consistent cuts (III):
Lattice of consistent cuts
C2
C1
e1
e0
e2
x
x
e4
x
e6
x
x
e8
e9
x
x
e11
x
e5
x
xe7
e3
e10
x
x
e12
x
x
e13
C1  C 2 : C1  C 2
Join (C1, C 2) : C1 C 2
Meet (C1, C 2) : C1 C 2
Un-coordinated check-pointing
p
*
q
r
• Each process saves its local state at start, and then
whenever it wants.
• Events: compute,send,receive,take check-point
• Recovery line: frontier of any consistent cut, whose events
are all check-points
• Is there an optimum recovery line? How do we find it?
Check-point Dependency Graph
p
q
r
*
p
q
r
• Nodes
– One for each local check-point
– One for current state of each surviving process
• Edges: one for each message (e,e’) from some P to Q
– Source is node for last check-point on P that happened before e
– Destination is node n on Q for first check-point/current state such
that e’ happened before n
Properties of check-point
dependency graph
p
q
r
*
p
q
r
• Node c2 is reachable from node c1 in graph iff
check-point corresponding to c1 happens before
check-point corresponding to c2.
Finding optimum recovery line
p
q
r
*
RL3
RL2
RL1
RL0
p
q
r
• RL0 = { last nodes on each process }
• While (there exist u,v in RLi | v is reachable from u)
– RLi+1 = RLi – {v} + {node before v in same process as v}
• Final RL when loop terminates is optimum recovery line
• See later to make this into an algorithm.
Correctness
p
q
r
• Algorithm obviously computes a set of concurrent
check-points, one from each process.
• From Property (II), it follows that these checkpoints are frontier of a consistent cut.
Optimality
p
q
r
• Suppose O is better recovery line.
• O cannot be RLO; otherwise, our algorithm succeeds. So
RL0 is better than O.
• Consider iteration when RLi is better than O but RLi+1is
not. There exist u,v in RLi such that v is reachable from u
and RLi+1 is obtained from Rli by dropping v and taking
check-point prior to v. Therefore, v must be in O. Let x in
O be check-point on same process as u. We see that
xuv, which contradicts Property(II).
Finding recovery line efficiently
p
q
r
• Node colors
– Yellow: on current recovery line
– Red: beyond current recovery line
– Green: behind current recovery line
• Bad edge:
– Source is red/yellow
– Destination is yellow/green
• Algorithm: propagate redness forward from destination
bad edges
Algorithm
• Mark all nodes green
• For each node l that is last node of process
– Mark node yellow
– Add each edge (l,d) to worklist
• While worklist is nonempty do
–
–
–
–
–
–
Get edge (s,d) from worklist;
If color(d) is red continue;
L = node to left of d;
Mark L yellow; Add all bad edges (L,d) to worklist;
R = first red node to right of d;
For each node t in interval [d,R)
• Mark t red;
• Add all bad edges of form (t,d) to worklist;
Remarks
• Complexity of algorithm: O(|E|+|V|)
– Each node is touched at most 3 times to mark it
green, yellow,red
– Each edge is examined at most twice
• Once when its source goes green yellow
• Once when its source goes yellow  red
• Another approach: use rollback dependency
graph (see Alvisi et al)
Practical details
p
q
r
*
• Each process numbers its checkpoints starting at 0.
• When a message is sent from S to R, number of last checkpoint is piggybacked on message.
• Receiver of message saves message + piggyback in log.
• When checkpoint is taken, message log is also saved on
disk.
• In-flight messages can be recovered from this log after
recovery line has been established.
Garbage collection of saved
states
• Garbage collection of old states is key
problem.
• One solution: run the recovery line
algorithm once in a while even if there is no
failure, and GC all states behind the
recovery line.
Application-level
Check-pointing
Recall
• We have seen system-level check-pointing.
• Trouble with system-level check-pointing:
– lot of data saved at each check-point
• PC, registers, stack, heap, some O/S state,network state,…
• thin pipe to disk problem
– lack of portability
• processor/OS state is very implementation-specific
• cannot restart check-point on different platform
• cannot restart check-point on different number of processors
• One alternative: application-level check-pointing
Application-level check-pointing
• Key idea: permit user to specify
– what variables should be saved at a check-point
– program point where check-point should be taken
• Example: protein-folding
– save only positions and velocities of bases
– check-point at end of time-step
• Advantages:
– less data saved
• only live data needs to be saved
• check-point at program points where live data is small and no
in-flight messages
– data can be saved in implementation-independent
manner
Warning
• This is more complex than it appears!
• We must restore
– PC: need to save where check-point was taken
– registers
– stack
• In general, many active procedure invocations when checkpoint is taken.
• How do we restore stack so procedure returns etc. happen
correctly?
• Heap: restored heap data will be in different
locations than at check-point
Right intuition
• In application-level check-pointing, we must use
the saved variables to recompute the system state
we would have saved in system-level checkpointing, modulo relocation of heap variables.
• Recovery script:
– code that is executed to accomplish this
– distinct from user code, but obviously derived from it
– however, needs to woven into user code to simplify
problems such as register restoration
Example: DOME
(Beguelin et al,CMU)
• Distributed Object Migration Environment
(DOME)
• C++ library of data parallel objects automatically
distributed over networks of heterogenous workstations
• Application-level check-pointing and restart
supported
– User-level
– Pre-processor based
Simple case
• Most computation occurs in a loop in main
• Solution:
–
–
–
–
put one check-point at bottom of loop
live variables at bottom of loop are globals
write script to save and restore globals
weave script into main
Dome example
main (int argc, char *argv[])
*
*
*
*
*
*
{dome-init(argc,argv);
//* statements are introduced for failure recovery
//prefix d on variable type says “save me at checkpoint”
dScalar<int> integer-variable;
dScalar<float> float-variable;
dVector<int> int-vector;
if (! is_dome_restarting())
execute_user_initialization_code(…);
while (!loop_done(…)) {
//loop_done uses only saved variables
do_computation(…);
dome_check_point();
}
}
Analysis
• Let us understand how this code restores processor
state
– PC: we drop into loop after restoring globals
– registers: by making recovery script part of main, we
ensure that register contents at top of loop are same for
normal execution and for restart
– stack: we re-execute main, so frame is restored
– heap: restored from saved check-point but may be
relocated
• Think: this works even if we restart on different
machine!
Remarks
• Loop body is allowed to make function calls
– real restriction is that there is one check-point and it
must be in main
• Command-line parameter is used to determine
whether execution is normal or restart
• User must write some code to restore variables
from check-point
– perhaps library code can help
More complex example
f() {
dScalar<int> i;
do_f_stuff;
g(i);
next_statement;
…;
}
g(dScalar<int> &I) {
do_g_stuff_1;
dome_checkpoint();
do_g_stuff_2;
}
General scenario
• Check-point could happen deep inside a
bunch of procedure calls.
• On restart, we need to restore stack so
procedure returns etc. can happen normally.
• Solution: save information about which
procedure invocations are live at checkpoint
Example with Dome constructs
f() {
g(dScalar<int> &I) {
dScalar<int> i;
if (is_dome_restarting()) {
next_call = dome_get_next_call();
…..}
do_f_stuff;
dome_push(“g1”);
g1:
g(i);
dome_pop();
next_statement;
…;
}
if (is_dome_restarting())
goto restart_done;
do_g_stuff_1;
dome_checkpoint();
restart_done:
do_g_stuff_2;
}
Challenge
• Do this for MPI code.
• Can compiler determine
– where to check-point?
– what data to check-point?
• Need not save all data live at check-point
– if some variables can be easily recomputed from saved
data and program constants, we can re-compute those
values in the recovery script.
– we can modify program to make this easier.
• Measure of success: beat hand-written recovery
code
Download