Failure Detectors CS 717 Ashish Motivala Dec 6th 2001 Some Papers Relevant Papers • Unreliable Failure Detectors for Reliable Distributed Systems. Tushar Deepak Chandra and Sam Toueg. Journal of the ACM. • A gossip-style failure detection service. R. van Renesse, Y. Minsky, and M. Hayden. Middleware '98. • Scalable Weakly-consistent Infection-style Process Group Membership protocol. Ashish Motivala, Abhinandan Das, Indranil Gupta. To be submitted to DSN 2002 tomorrow. http://www.cs.cornell.edu/gupta/swim • On the Quality of Service of Failure Detectors. Wei Chen, Cornell University (with Sam Toueg, Advisor, and Marcos Aguilera, Contributing Author). DSN 2000. • Fail-aware failure detectors. C. Fetzer and F. Cristian. In Proceedings of the 15th Symposium on Reliable Distributed Systems. Asynchronous vs Synchronous Model – No value to assumptions about process speed – Network can arbitrarily delay a message – But we assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through. • Failures in asynchronous model? • Usually, limited to process “crash” faults – If detectable, we call this “fail-stop” – but how to detect? Asynchronous vs Synchronous Model • No value to assumptions about process speed • Network can arbitrarily delay a message • But we assume that messages are sequenced and retransmitted (arbitrary numbers of times), so they eventually get through. • Assume that every process will run within bounded delay • Assume that every link has bounded delay • Usually described as “synchronous rounds” Failures in Asynchronous and Synchronous Systems • Usually, limited to process “crash” faults • If detectable, we call this “fail-stop” – but how to detect? • Can talk about message “omission” failures: failure to send is the usual approach • But network assumed reliable (loss “charged” to sender) • Process crash failures, as in asynchronous setting • “Byzantine” failures: arbitrary misbehavior by processes Realistic??? • Asynchronous model is too weak since they have no clocks(real systems have clocks, “most” timing meets expectations… but heavy tails) • Synchronous model is too strong (real systems lack a way to implement synchronize rounds) • Partially Synchronous Model: async n/w with a reliable channel • Timed Asynchronous Model: time bounds on clock drift rates and message delays [Fetzer] Impossibility Results • Consensus: All processes need to agree on a value • FLP Impossibility of Consensus – A single faulty process can prevent consensus – Realistic because a slow process is indistinguishable from a crashed one. • Chandra/Toueg Showed that FLP Impossibility applies to many problems, not just consensus – In particular, they show that FLP applies to group membership, reliable multicast – So these practical problems are impossible in asynchronous systems • They also look at the weakest condition under which consensus can be solved Byzantine Consensus • Example: 3 processes, 1 is faulty (A, B, C) • Non-faulty processes A and B start with input 0 and 1, respectively • They exchange messages: each now has a set of inputs {0, 1, x}, where x comes from C • C sends 0 to A and 1 to B • A has {0, 1, 0} and wants to pick 0. B has {0, 1, 1} and wants to pick 1. • By definition, impossibility in this model means “xxx can’t always be done” Chandra/Toueg Idea • Theoretical Idea • Separate problem into – The consensus algorithm itself – A “failure detector:” a form of oracle that announces suspected failure – But the process can change its decision • Question: what is the weakest oracle for which consensus is always solvable? Sample properties • Completeness: detection of every crash – Strong completeness: Eventually, every process that crashes is permanently suspected by every correct process – Weak completeness: Eventually, every process that crashes is permanently suspected by some correct process Sample properties • Accuracy: does it make mistakes? – Strong accuracy: No process is suspected before it crashes. – Weak accuracy: Some correct process is never suspected – Eventual {strong/ weak} accuracy: there is a time after which {strong/weak} accuracy is satisfied. A sampling of failure detectors Completeness Accuracy Strong Weak Eventually Strong Eventually Weak Strong Perfect P Strong S Eventually Perfect P Eventually Strong S Weak D Weak W D Eventually Weak W Perfect Detector? • Named Perfect, written P • Strong completeness and strong accuracy • Immediately detects all failures • Never makes mistakes Example of a failure detector • The detector they call W: “eventually weak” • More commonly: W: “diamond-W” • Defined by two properties: – There is a time after which every process that crashes is suspected by some correct process {weak completeness} – There is a time after which some correct process is never suspected by any correct process {weak accuracy} • Eg. we can eventually agree upon a leader. If it crashes, we eventually, accurately detect the crash W: Weakest failure detector • They show that W is the weakest failure detector for which consensus is guaranteed to be achieved • Algorithm is pretty simple – Rotate a token around a ring of processes – Decision can occur once token makes it around once without a change in failure-suspicion status for any process – Subsequently, as token is passed, each recipient learns the decision outcome Building systems with W • Unfortunately, this failure detector is not implementable • This is the weakest failure detector that solves consensus • Using timeouts we can make mistakes at arbitrary times Group Membership Service Process Group pi X Asynchronous Lossy Network X Join Leave Failure pj pi pj’s Membership list Data Dissemination using Epidemic Protocols • Want efficiency, robustness, speed and scale • Tree distribution is efficient, but fragile and hard configure • Gossip is efficient and robust but has high latency. Almost linear in network load and scales O(nlogn) in detection time with number of processes. State Monotonic Property • A gossip message contains the state of the sender of the gossip. • The receiver used a merge function to merge the received state and the sent state. • Need some kind of monotonicity in state and in gossip Simple Epidemic • Assume a fixed population of size n • For simplicity, assume homogeneous spreading – Simple epidemic: any one can infect any one with equal probability • Assume that k members are already in infected • And that the infection occurs in rounds Probability of Infection • Probability Pinfect(k,n) that a particular uninfected member is infected in a round if k are already in a round if k are already infected? • Pinfect(k,n) = 1 – P(nobody infects member) = 1 – (1 – 1/n)k E(#newly infected members) = (n-k)x Pinfect(k,n) Basically its a Binomial Distribution 2 Phases • Intuition: 2 Phases • First Half: 1 -> n/2 • Second Half: n/2 -> n Phase 1 Phase 2 • For large n, Pinfect(n/2,n) ~ 1 – (1/e)0.5 ~ 0.4 Infection and Uninfection • Infection – Initial Growth Factor is very high about 2 – At the half way mark its about 1.4 – Exponential growth • Uninfection – Slow death of uninfection to start – At half way mark its about 0.4 – Exponential decline Rounds • Number of rounds necessary to infect the entire population is O(log n) • Robbert uses and base of 1.585 for experiments How the Protocol Works • Each member maintains a list of (address heartbeat) pairs. • Periodically each member gossips: – Increments his heartbeat – Sends (part of) list to a randomly chosen member • On receipt of gossip, merge the lists • Each member maintains the last heartbeat of each list member SWIM Group Membership Service Process Group pi X Asynchronous Lossy Network X Join Leave Failure pj pi pj’s Membership list System Design • Join, Leave, Failure : broadcast to all processes • Need to detect a process failure at some process quickly (to be able to broadcast it) • Failure Detector Protocol Specifications – Detection Time – Accuracy – Load Specified by application designer to SWIM Optimized by SWIM SWIM Failure Detector Protocol pi pj K random processes X Protocol period = T time units X Properties • Expected Detection time e/(e-1) protocol periods • Load: O(K) per process – Inaccuracy probability exponential in K • Process failures detected – in O(log N) protocol periods w.h.p. – in O(N) protocol periods deterministically Why not Heartbeating ? • Centralized : single failure point • All-to-all : O(N) load per process • Logical ring : unpredictability on multiple failures LAN Scalability 6 5 4 Experimental Expected 3 2 1 22 20 18 16 14 12 10 8 6 4 0 2 Mean Time to Failure Detection / RTT 7 Number of Processes Win2000, 100 Base-T Ethernet LAN Protocol Period = 3*RTT, RTT=10 ms, K=1 Deployment • Broadcast ‘suspicion’ before ‘declaring’ process failure • Piggyback broadcasts through ping messages – Epidemic-style broadcast • WAN – Load on core routers – No representatives per subnet/domain