CS 717: Programming for Fault-tolerance Keshav Pingali Cornell University Background for this talk • Performance still important • But ability of software to adapt to faults is becoming crucial • Most existing work by OS/distributed systems people • Program analysis/compiling perspective: – new ways of looking at problems – new questions Computing Model • Message-passing (MPI) • Fixed number of processes • Abstraction: process actions – – – – compute send(m,r) //send message m to process r receive(s) //receive message from process s receive_any() //receive message from any process • FIFO channels between processes Fault Model • Fail-stop processes (cf. Byzantine failure) • Multiple failures possible • Number of processes before crash = Number of processes after crash recovery (cf. N M) Goals of fault recovery protocol • Resilience to failure of any number of processes • Efficient recovery from failure of small number of processes • Avoid rolling back processes that have not failed • Do not modify application code if possible • Use application-level information to reduce overheads • Reduce disk accesses Mechanisms • Stable storage: – disk – survives process crashes – accessible to surviving processes • Volatile log: – RAM associated with process – evaporates when process fails • Piggybacking: – protocol information hitched onto application messages Key idea: causality (Lamport) f P Q R a f b d c e e z d b z c a • Execution events: compute, send, receive • Happened-before relation on execution events: e1 < e2 – e1,e2 done by same process, and e1 was done before e2 – e1 is send and e2 is matching receive – transitivity: there exists ek such that e1 < ek and ek < e2 • Intuition: like coarse-grain dependence information Key idea: consistent cut P Q (I) (II) (III) (IV) • Set containing one state for each process (“timeline”) • Event e behind timeline => events that “happened-before” e are also behind timeline • Intuitively, every message that has been received by a process has already been sent by some other process (as in I,II,IV) • There may be messages “in-flight” (as in II) Classification of recovery protocols Recovery Protocols save state on stable storage Check-pointing Uncoordinated each process saves its state independently of others log messages and replay Message-logging processes co-operatively Coordinated save distributed state Blocking Non-blocking hardware/software distributed snap-shot barrier Classification of recovery protocols Recovery Protocols save state on stable storage Check-pointing Uncoordinated each process saves its state independently of others log messages and replay Message-logging processes co-operatively Coordinated save distributed state Blocking Non-blocking hardware/software distributed snap-shot barrier Uncoordinated Checkpointing • Each process saves its state independently of other processes • Each process numbers its checkpoints starting at 0 • Upon failure of any process, all processes cooperate to find “recovery line” (consistent cut + in-flight messages) m P Q n m+1 n+1 :checkpoints Consistent cuts: {m,n}, {m,n+1},{m+1,n+1} Not consistent cuts: {m+1,n} Rollback dependency graph • Nodes: for each process – one node per checkpoint – one node for current state • Edges: (Sn Rm) if m R S n Intuition: if Sn cannot be on recovery line, neither can Rm • Algorithm: propagate badness starting from current state nodes of failed processes Example P0 00 02 01 P1 10 11 03 12 P2 20 21 P3 30 31 13 22 X 00 X 10 11 12 20 21 22 30 31 23 32 33 01 02 03 32 13 23 33 (b) Roll-back dependence graph (a) Time-line 00 01 03 02 10 11 12 20 21 22 30 31 32 X X 13 X23 X : state on recovery line 33 © Propagation of badness Protocol P Q n • Each process maintains “next-checkpoint-#” – Incremented when checkpoint is taken • Send: piggyback “next-checkpoint-#” on message • Receive/receive_any: save (Q,data,n) in log • At checkpoint: – save local state and log on stable storage – empty log • SOS from a process: – Send current log to recovery manager – Wait to be informed about where to rollback to – Rollback • In-flight messages: omitted from talk Discussion • Easy to modify our algorithm to find recovery line with no in-flight messages • No messages or coordination required to take local checkpoints • Protocol can boot-strap on any algorithm for saving uniprocessor state • Cascading rollback possible • Discarding local checkpoints: requires finding current recovery line (global coordination…) • One process fails => all processes may be rolled back Classification of recovery protocols Recovery Protocols save state on stable storage Check-pointing Uncoordinated each process saves its state independently of others log messages and replay Message-logging processes co-operatively Coordinated save distributed state Blocking Non-blocking hardware/software distributed snap-shot barrier Non-blocking Coordinated Checkpointing • Distributed snapshot algorithms – Chandy and Lamport, Dijkstra, etc. • Key features: (cf. uncoordinated chkpting) – Processes do not necessarily save local state at same time or same point in program – Coordination ensures saved states form consistent cut Chandy/Lamport algorithm • Process graph – Static – Forms strongly connected component • Some process is “checkpoint coordinator” – Initiates taking of snapshot – Detects when all processes have completed local checkpoints – Advances global snapshot number • Coordination is done using marker tokens sent along process graph channels Protocol (simplified) • Coordinator – Saves its local state – Sends marker tokens on all outgoing edges • Other processes – When first marker is received, save local state and send marker tokens on all outgoing channels. • All processes – Subsequent markers are simply eaten up. – Once markers have been received on all input channels, inform coordinator that local checkpoint is done. • Coordinator – once all processes are done, advances snapshot number Example Sketch of correctness proof P d Can anti-causal message d exist? Q • P must have sent marker on channel PQ before it sent application message d • Q must have received marker before it received d • So Q must have taken checkpoint before receiving d • anti-causal message like d cannot exist Discussion • Easy to modify protocol to save in-flight messages with local check-point • No cascading roll-back • Number of coordination messages = |E| + |N| • Discarding snapshots is easy • One process fails all processes roll back Classification of recovery protocols Recovery Protocols save state on stable storage Check-pointing Uncoordinated each process saves its state independently of others log messages and replay Message-logging processes co-operatively Coordinated save distributed state Blocking Non-blocking hardware/software distributed snap-shot barrier Message Logging • When process P fails, it is restarted from the beginning. • To redo computations, P needs messages sent to it by other processes before it failed. • Other processes help process P by replaying messages they had sent it, but are not themselves rolled back. • In principle, no stable storage required. Example P d1 Q SOS d3 rcv(R) rcv(P) rcv(P) d2 <d1,d3> X rcv(P) SOS rcv(R) rcv(P) <d2> R • Data structures: each process p maintains – SENDLOG[q]: messages sent to q by p – REPLAYLOG[q]: messages from q that are being replayed How about messages sent by failed process? P d1 d3 Q d4 snd(P) d2 SOS <d1,d3>,1 SOS <d2>,0 snd(P) X R • Each process p maintains – RC[q]: number of messages received from q – SS[q]: number of messages to q that must be suppressed during recovery Protocol • Each process p maintains – SENDLOG[q]: messages sent to q – RC[q]: # of messages received from q – REPLAYLOG[q]: messages from q that are being replayed during recovery of p – SS[q]:# of messages to q that must be suppressed during recovery of p Protocol (contd) • Send(q,d): – Append d to SENDLOG[q] – If (SS[q] > 0) then SS[q]--; else MPI_SEND(…); • Receive(q): – If (REPLAYLOG[q] is empty) then {MPI_RECEIVE(…); RC[q]++;} else getNext(REPLAYLOG[q]); Protocol (contd) • SOS(q): – MPI_SEND(…,q,<SENDLOG[q],RC[q]>) – SS[q] = 0; Protocol(contd) • Fail: – for each other process q do {REPLAYLOG[q] = SENDLOG[q] = empty; SS[q] = RC[q] = 0; MPI_SEND(..,q,SOS); } for each other process q do {discard application messages till you get SOS response; update REPLAYLOG[q],SS[q],RC[q] from response; } start execution from initial state; Problem with receive_any P Q R d1 SOS rcv?() rcv?() d2 X SOS SOS <d1>,0 <d2>,0 rcv?() <>,1 T • Process Q uses receive_any’s to receive d1 from P first and d2 from R next • Then it sends message to T containing data that might depend on receipt order of these messages • During recovery, Q does not know what choices it made before failure Discussion • Resilient to any number of failures • Only failed processes are rolled back • SENDLOG keeps growing as messages are exchanged – Do coordinated check-pointing once in a while to discard logs • “Deterministic” protocol: does not work if program has receive_any’s • Orphan process: state of T depends on lost choices Solutions • Pessimistic protocol: no orphans – process saves non-deterministic choices on stable storage before sending any message • Optimistic protocol: (Bacon/Strom/Yemeni) – during recovery, find orphans and kill them off • Causal logging: no orphans – piggyback choices on outgoing messages – ensures receiving process knows all choices it is causally dependent on Example A B P Q R ? ? <P:A,B> ? <P:A,B, Q:P> • Message carries all choices it is causally dependent on • Optimization: avoid resending same information Discussion • Piggybacked choices on incoming messages are stored in log • Log also stores choices made by process • Optimized protocol sends incremental choice log on outgoing messages • Resilient to any number of failures – any process affected by my choices knows my choices and sends them to me if I fail – if no process knows what choices I made, I am free to choose differently when I recover Trade-off between resilience and overhead • Suppose resilience needs only for f (< N) failures – stop propagating choice once choice has been logged in f+1 processes • Hard problem – how do we know in a distributed system that some piece of information has been sent to at least f+1 processes….. • FBL protocols (Alvisi/Marzullo) Special case of f < 3 is easy A P B ? Q R ? ? C <P:A,B> <P:A,B; Q:C> <Q:C> S • When process sends messages, it piggybacks – choices it made – choices made by processes that have communicated directly with it Discussion • Check-pointing + logging: – check-pointing gives resilience to any number of failures but rolls back all processes – logging gives optimized resilience to small number of failures – check-pointing reduces size of logs • Overhead in tracking non-deterministic choices in receive_any’s may be substantial Research Questions (1) How much non-determinism needs to be tracked in scientific programs? • Many uses of receive_any are actually deterministic in a “global” sense. (see next slide) • These choices need not be tracked. Deterministic uses of receive_any • Implementation of reduction operations – No need to track choices d1 ? ? d2 d1+d2 • Stateless servers – compute server – read-only data look-up • Other patterns? ? ? (2) Happened-before is an approximation of “causality” (dependence).How to exploit this? f(d1) d1 Stateless server ? ? d2 f(d2) • Post hoc ergo procter hoc • In general, an event is dependent only on a subset of events that happened-before it. – (eg) think dependence analysis of sequential programs • Can we reduce piggybacking overheads by using program analysis to compute dependence more precisely? (3) Recovery program need not be same as source program. • Can we compile an optimized “recovery script” for the case of single process failure? f(d1) d1 Stateless server ? ? d2 f(d2) • During recovery, suppress not only messages that were already sent by failed process but also the associated computations if possible (4) Recovery with different number of processes • Requires application intervention • Some connection with load-balancing (“load-imbalancing”) • Virtualization of names (5) How do we extend this to active-message and shared-memory models? • Active-messages: one-sided communication (as in Blue Gene) • Shared-memory: – connection with memory consistency models – acquire/release etc. have no direct analog in message-passing recovery model (6) How do we handle Byzantine failures? – Fail-stop behavior is idealization • In reality, processes may corrupt state, send bad data etc. – How do we detect and correct such problems? – Redundancy is key, but TMR is too blunt an instrument. – Generalize approaches like check-sums, Hamming codes, etc. • Fault-tolerant BLAS: Dongarra et al, Van de Geijn d f Decode Encode E(d) f(d) f f(E(d)) Simple integrity test tells you if f(E(d) is OK