CS 717: Programming for Fault-tolerance Keshav Pingali Cornell University

advertisement
CS 717:
Programming for Fault-tolerance
Keshav Pingali
Cornell University
Background for this talk
• Performance still important
• But ability of software to adapt to faults is
becoming crucial
• Most existing work by OS/distributed
systems people
• Program analysis/compiling perspective:
– new ways of looking at problems
– new questions
Computing Model
• Message-passing (MPI)
• Fixed number of processes
• Abstraction: process actions
–
–
–
–
compute
send(m,r) //send message m to process r
receive(s) //receive message from process s
receive_any() //receive message from any
process
• FIFO channels between processes
Fault Model
• Fail-stop processes
(cf. Byzantine failure)
• Multiple failures possible
• Number of processes before crash
= Number of processes after crash recovery
(cf. N
M)
Goals of fault recovery protocol
• Resilience to failure of any number of processes
• Efficient recovery from failure of small number of
processes
• Avoid rolling back processes that have not failed
• Do not modify application code if possible
• Use application-level information to reduce
overheads
• Reduce disk accesses
Mechanisms
• Stable storage:
– disk
– survives process crashes
– accessible to surviving processes
• Volatile log:
– RAM associated with process
– evaporates when process fails
• Piggybacking:
– protocol information hitched onto application messages
Key idea: causality (Lamport)
f
P
Q
R
a
f
b
d
c
e
e
z
d
b
z
c
a
• Execution events: compute, send, receive
• Happened-before relation on execution events: e1 < e2
– e1,e2 done by same process, and e1 was done before e2
– e1 is send and e2 is matching receive
– transitivity: there exists ek such that e1 < ek and ek < e2
• Intuition: like coarse-grain dependence information
Key idea: consistent cut
P
Q
(I)
(II)
(III)
(IV)
• Set containing one state for each process (“timeline”)
• Event e behind timeline =>
events that “happened-before” e are also behind timeline
• Intuitively, every message that has been received by a
process has already been sent by some other process
(as in I,II,IV)
• There may be messages “in-flight” (as in II)
Classification of recovery
protocols
Recovery Protocols
save state on stable storage
Check-pointing
Uncoordinated
each process saves its state
independently of others
log messages and replay
Message-logging
processes co-operatively
Coordinated save distributed state
Blocking
Non-blocking
hardware/software distributed snap-shot
barrier
Classification of recovery
protocols
Recovery Protocols
save state on stable storage
Check-pointing
Uncoordinated
each process saves its state
independently of others
log messages and replay
Message-logging
processes co-operatively
Coordinated save distributed state
Blocking
Non-blocking
hardware/software distributed snap-shot
barrier
Uncoordinated Checkpointing
• Each process saves its state independently of other
processes
• Each process numbers its checkpoints starting at 0
• Upon failure of any process, all processes cooperate to
find “recovery line” (consistent cut + in-flight messages)
m
P
Q
n
m+1
n+1
:checkpoints
Consistent cuts:
{m,n}, {m,n+1},{m+1,n+1}
Not consistent cuts:
{m+1,n}
Rollback dependency graph
• Nodes: for each process
– one node per checkpoint
– one node for current state
• Edges: (Sn  Rm) if
m
R
S
n
Intuition: if Sn cannot be on recovery line, neither can Rm
• Algorithm: propagate badness starting from
current state nodes of failed processes
Example
P0
00
02
01
P1 10 11
03
12
P2 20
21
P3 30
31
13
22
X
00
X
10
11
12
20
21
22
30
31
23
32
33
01
02
03
32
13
23
33
(b) Roll-back dependence graph
(a) Time-line
00
01
03
02
10
11
12
20
21
22
30
31
32
X
X
13
X23 X
: state on recovery line
33
© Propagation of badness
Protocol
P
Q
n
• Each process maintains “next-checkpoint-#”
– Incremented when checkpoint is taken
• Send: piggyback “next-checkpoint-#” on message
• Receive/receive_any: save (Q,data,n) in log
• At checkpoint:
– save local state and log on stable storage
– empty log
• SOS from a process:
– Send current log to recovery manager
– Wait to be informed about where to rollback to
– Rollback
• In-flight messages: omitted from talk
Discussion
• Easy to modify our algorithm to find recovery line
with no in-flight messages
• No messages or coordination required to take local
checkpoints
• Protocol can boot-strap on any algorithm for
saving uniprocessor state
• Cascading rollback possible
• Discarding local checkpoints: requires finding
current recovery line (global coordination…)
• One process fails =>
all processes may be rolled back
Classification of recovery
protocols
Recovery Protocols
save state on stable storage
Check-pointing
Uncoordinated
each process saves its state
independently of others
log messages and replay
Message-logging
processes co-operatively
Coordinated save distributed state
Blocking
Non-blocking
hardware/software distributed snap-shot
barrier
Non-blocking Coordinated
Checkpointing
• Distributed snapshot algorithms
– Chandy and Lamport, Dijkstra, etc.
• Key features: (cf. uncoordinated chkpting)
– Processes do not necessarily save local state at
same time or same point in program
– Coordination ensures saved states form
consistent cut
Chandy/Lamport algorithm
• Process graph
– Static
– Forms strongly connected component
• Some process is “checkpoint coordinator”
– Initiates taking of snapshot
– Detects when all processes have completed local checkpoints
– Advances global snapshot number
• Coordination is done using marker tokens sent along
process graph channels
Protocol (simplified)
• Coordinator
– Saves its local state
– Sends marker tokens on all outgoing edges
• Other processes
– When first marker is received, save local state and send
marker tokens on all outgoing channels.
• All processes
– Subsequent markers are simply eaten up.
– Once markers have been received on all input channels,
inform coordinator that local checkpoint is done.
• Coordinator
– once all processes are done, advances snapshot number
Example
Sketch of correctness proof
P
d
Can anti-causal message d exist?
Q
• P must have sent marker on channel PQ before it sent
application message d
• Q must have received marker before it received d
• So Q must have taken checkpoint before receiving d
•  anti-causal message like d cannot exist
Discussion
• Easy to modify protocol to save in-flight
messages with local check-point
• No cascading roll-back
• Number of coordination messages
= |E| + |N|
• Discarding snapshots is easy
• One process fails  all processes roll back
Classification of recovery
protocols
Recovery Protocols
save state on stable storage
Check-pointing
Uncoordinated
each process saves its state
independently of others
log messages and replay
Message-logging
processes co-operatively
Coordinated save distributed state
Blocking
Non-blocking
hardware/software distributed snap-shot
barrier
Message Logging
• When process P fails, it is restarted from the
beginning.
• To redo computations, P needs messages
sent to it by other processes before it failed.
• Other processes help process P by replaying
messages they had sent it, but are not
themselves rolled back.
• In principle, no stable storage required.
Example
P
d1
Q
SOS
d3
rcv(R)
rcv(P)
rcv(P)
d2
<d1,d3>
X
rcv(P)
SOS
rcv(R)
rcv(P)
<d2>
R
• Data structures: each process p maintains
– SENDLOG[q]: messages sent to q by p
– REPLAYLOG[q]: messages from q that are being
replayed
How about messages sent by failed process?
P
d1
d3
Q
d4
snd(P)
d2
SOS
<d1,d3>,1
SOS
<d2>,0
snd(P)
X
R
• Each process p maintains
– RC[q]: number of messages received from q
– SS[q]: number of messages to q that must be
suppressed during recovery
Protocol
• Each process p maintains
– SENDLOG[q]: messages sent to q
– RC[q]: # of messages received from q
– REPLAYLOG[q]: messages from q that are
being replayed during recovery of p
– SS[q]:# of messages to q that must be
suppressed during recovery of p
Protocol (contd)
• Send(q,d):
– Append d to SENDLOG[q]
– If (SS[q] > 0) then SS[q]--;
else MPI_SEND(…);
• Receive(q):
– If (REPLAYLOG[q] is empty)
then {MPI_RECEIVE(…); RC[q]++;}
else getNext(REPLAYLOG[q]);
Protocol (contd)
• SOS(q):
– MPI_SEND(…,q,<SENDLOG[q],RC[q]>)
– SS[q] = 0;
Protocol(contd)
• Fail:
– for each other process q do
{REPLAYLOG[q] = SENDLOG[q] = empty;
SS[q] = RC[q] = 0;
MPI_SEND(..,q,SOS);
}
for each other process q do
{discard application messages till you get SOS response;
update REPLAYLOG[q],SS[q],RC[q] from response;
}
start execution from initial state;
Problem with receive_any
P
Q
R
d1
SOS
rcv?()
rcv?()
d2
X
SOS
SOS
<d1>,0
<d2>,0
rcv?()
<>,1
T
• Process Q uses receive_any’s to receive d1 from P first and
d2 from R next
• Then it sends message to T containing data that might
depend on receipt order of these messages
• During recovery, Q does not know what choices it made
before failure
Discussion
• Resilient to any number of failures
• Only failed processes are rolled back
• SENDLOG keeps growing as messages are
exchanged
– Do coordinated check-pointing once in a while to
discard logs
• “Deterministic” protocol: does not work if
program has receive_any’s
• Orphan process: state of T depends on lost choices
Solutions
• Pessimistic protocol: no orphans
– process saves non-deterministic choices on stable
storage before sending any message
• Optimistic protocol: (Bacon/Strom/Yemeni)
– during recovery, find orphans and kill them off
• Causal logging: no orphans
– piggyback choices on outgoing messages
– ensures receiving process knows all choices it is
causally dependent on
Example
A
B
P
Q
R
?
?
<P:A,B>
?
<P:A,B, Q:P>
• Message carries all choices it is causally
dependent on
• Optimization: avoid resending same information
Discussion
• Piggybacked choices on incoming messages are
stored in log
• Log also stores choices made by process
• Optimized protocol sends incremental choice log
on outgoing messages
• Resilient to any number of failures
– any process affected by my choices knows my choices
and sends them to me if I fail
– if no process knows what choices I made, I am free to
choose differently when I recover
Trade-off between resilience and overhead
• Suppose resilience needs only for f (< N) failures
– stop propagating choice once choice has been
logged in f+1 processes
• Hard problem
– how do we know in a distributed system that
some piece of information has been sent to at
least f+1 processes…..
• FBL protocols (Alvisi/Marzullo)
Special case of f < 3 is easy
A
P
B
?
Q
R
?
?
C
<P:A,B>
<P:A,B; Q:C>
<Q:C>
S
• When process sends messages, it piggybacks
– choices it made
– choices made by processes that have communicated
directly with it
Discussion
• Check-pointing + logging:
– check-pointing gives resilience to any number
of failures but rolls back all processes
– logging gives optimized resilience to small
number of failures
– check-pointing reduces size of logs
• Overhead in tracking non-deterministic
choices in receive_any’s may be substantial
Research Questions
(1) How much non-determinism needs to be
tracked in scientific programs?
• Many uses of receive_any are actually
deterministic in a “global” sense.
(see next slide)
• These choices need not be tracked.
Deterministic uses of receive_any
• Implementation of
reduction operations
– No need to track choices
d1 ?
?
d2
d1+d2
• Stateless servers
– compute server
– read-only data look-up
• Other patterns?
?
?
(2) Happened-before is an approximation of
“causality” (dependence).How to exploit this?
f(d1)
d1
Stateless server
?
?
d2
f(d2)
• Post hoc ergo procter hoc
• In general, an event is dependent only on a subset of events
that happened-before it.
– (eg) think dependence analysis of sequential programs
• Can we reduce piggybacking overheads by using program
analysis to compute dependence more precisely?
(3) Recovery program need not be same as
source program.
• Can we compile an optimized “recovery script”
for the case of single process failure?
f(d1)
d1
Stateless server
?
?
d2
f(d2)
• During recovery, suppress not only messages that
were already sent by failed process but also the
associated computations if possible
(4) Recovery with different number of processes
• Requires application intervention
• Some connection with load-balancing
(“load-imbalancing”)
• Virtualization of names
(5) How do we extend this to active-message
and shared-memory models?
• Active-messages: one-sided communication
(as in Blue Gene)
• Shared-memory:
– connection with memory consistency models
– acquire/release etc. have no direct analog in
message-passing recovery model
(6) How do we handle Byzantine failures?
– Fail-stop behavior is idealization
• In reality, processes may corrupt state, send bad data
etc.
– How do we detect and correct such problems?
– Redundancy is key, but TMR is too blunt an
instrument.
– Generalize approaches like check-sums,
Hamming codes, etc.
• Fault-tolerant BLAS: Dongarra et al, Van de Geijn
d
f
Decode
Encode
E(d)
f(d)
f
f(E(d))
Simple integrity test tells you if f(E(d) is OK
Download