DCS 13. Membership Protocols Wei Yuan 2014-1-10 Outline • Basics • Gossip-style failure detection service (Middleware,1998) • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol (DSN,2002) 2 Membership • In dynamic distributed system, a node needs knowledge of the states(alive/failed) of other nodes • Process group-based systems • Membership – Failure detection protocol • Failure Detector (FD) • Focus on fail-stop failures – A process halts and does not execute any further operations • Two styles of failure detectors – Ping-Ack Protocol – Heartbeating Protocol Ping ๐๐ ๐๐ ack heartbeat ๐๐ ๐๐ 3 Membership Protocol • Provides each process (“member”) of the group with a locallymaintained list of other non-faulty processes in the group. • Ensures that the membership list is updated with changes (members joining, or dropping out). • Membership list is available to the application. Application queries, updates, etc. Failure Detector Dissemination Membership Protocol Joins, leaves, fails of members Group Membership List Unreliable Network 4 Evaluation Metrics for FD • Completeness = every process failure is eventually detected (no misses) • Accuracy = every detected failure corresponds to a crashed process (no mistaken detection) • Speed – First detection time – Dissemination time Failure First detection Detection by all nodes Time axis Detection time = First detection time + Dissemination time 5 • Scalability – Equal Load on each member – Network Message Load (in bytes per second generated by the protocol) • In practice – Completeness : guaranteed – Accuracy : probabilistic guarantee 6 • Distributed FD through heartbeating – Centralized – Ring – All-to-all 7 Centralized Heartbeating pi ๏Hotspot pi, Heartbeat Seq. l++ pj • Heartbeats sent periodically • If heartbeat not received from pi within timeout, mark pi as failed 8 Ring Heartbeating pi ๏Unpredictable on simultaneous multiple failures pi, Heartbeat Seq. l++ pj 9 All-to-All Heartbeating pi, Heartbeat Seq. l++ ๏Scalability Issue Unscalable: network load O(N^2) pi … pj 10 Outline • Basics • Gossip-style failure detection service (Middleware,1998) • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol (DSN,2002) 11 Assumptions about the network • No bound on message delivery time – Most message delivered within reasonable time • Failure Model – fail-stop (no Byzantine, processes do not lie) • Clock rate – close to accurate (low drift), not synchronized 12 Basic protocol • Each member maintains a list <๐๐ , ๐ป๐ ,๐๐๐๐ ๐ก,๐๐ > – ๐๐ : member address, ๐ป๐ : heartbeat counter, ๐๐๐๐ ๐ก,๐๐ : last time of heartbeat increase • Every ๐๐๐๐ ๐ ๐๐ , each member – Increments its heartbeat – Select a random member and send a list of <๐๐ , ๐ป๐ > • A member, on receiving gossip message, – Merge the list (maximum heartbeat for each member) • Each member occasionally broadcasts its list – To be located, recover from network partitions • A few gossip servers, – Lack of broadcast capability, well known addresses 13 Basic protocol 1 10118 64 2 10110 64 1 10120 66 3 10090 58 2 10103 62 4 10111 65 3 10098 63 4 10111 65 2 1 Address Time (local) Heartbeat Counter 4 Protocol: •Nodes periodically gossip their membership list 1 10120 70 2 10110 64 3 10098 70 4 10111 65 3 Current time : 70 at node 2 •On receipt, the local membership list is updated 14 Basic protocol • If the heartbeat has not increased for more than ๐๐๐๐๐ seconds, the member is considered failed • ๅฝไธไธช่็น่ขซ่ฎคไธบๆฏfaultyๆถ๏ผไธ่ฝ็ซๅป่ขซๅ ้คใ • After ๐๐๐๐๐๐๐ข๐ (๐๐๐๐๐๐๐ข๐ ≥ ๐๐๐๐๐ ) seconds, it will delete the member from the list • Only detect unreachable hosts, not detect link failures – A๏ปB, A๏กC &B๏กC, A will not suspect B nor the other way around. 15 Analysis (1) • Simplified Analysis • define a different concept of round, in which only one member is gossiping. • Parrival is the probability that a gossip successfully arrives at a chosen member before the start of the next round. 16 Analysis (2) As # members increases, the detection time increase 17 Analysis (3) • There is a trade-off between minimizing the probability of making a false detection, and the failure detection time. • little extra detection time is necessary for better quality As requirement is loosened, the detection time decrease. 18 Analysis (4) For three different probabilities of message loss q, the effect of failed members on detection time. ๏ผๅคฑๆ่ฟ็จๆฐ่ฟๅ, not resilient๏ผ As # failed members increases, the detection time increase significantly 19 Analysis(5) • The effect of message loss on detection time (Pmistake = 10-6) The algorithm is resilient to message loss 20 Problem Bottleneck : cross-subnet link 21 Outline • Concepts • Gossip-style failure detection service (Middleware,1998) • SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol (DSN,2002) 22 Why SWIM? • Traditional heartbeating protocols – Scalability problem • SWIM – Use a non heartbeat-based Failure Detection component. – Separate the failure detection component and membership update dissemination component of the membership protocol . – Achieve scalability in a large group. – Reduce false positive frequency. 23 The Basic SWIM Approach • Two components: A Failure Detector Component A Dissemination Component Mi Mj random Mj ping random K ping-req ๏ผMj๏ผ Protocol period = T’ time units K random processes ack X X ping ack ack 24 The Basic SWIM Approach • Two components: A Failure Detector Component A Dissemination Component – Upon detecting the failure of another group member, multicasts failed( ๐๐ ) msg to the rest of the group 25 Dissemination Options • Multicast – Only best-effort – unreliable • Point-to-point (TCP / UDP) – expensive • Zero extra messages: Piggyback on Failure Detector messages – Infection-style Dissemination 26 A Robust and Efficient SWIM • Infection-Style Dissemination Component • Suspicion Mechanism • Round-Robin Probe Target Selection 27 Infection-Style Dissemination Component – Piggyback on failure detector messages – No extra packets for dissemination – This epidemic process spreads exponentially fast in the group 28 Suspicion Mechanism • Goal – Reducing the frequency of false positive • False detection, due to – Network packet losses (e.g., from congestion) – Slow process • Indirect pinging may not solve the problem – e.g., correlated message losses near pinged host • Key: – Avoid to drop a healthy process at the very first instance that it is mistakenly detected as failed. – suspect a process before declaring it as failed in the group 29 Suspicion Mechanism • Mi pings Mj, if Mi receives no ack either directly or through indirect probing subgroup • Mi does not declare Mj as failed. • Mi marks Mj as a suspected member – {suspect Mj: Mi suspects Mj} msg is disseminated • If later on, Ml successfully pings Mj – {Alive Mj:Ml knows Mj is alive} un-marks suspicion • If Mi times-out Mj’s suspicion before receipt of an Alive msg – {Confirm Mj: Mi declares Mj as faulty} will be disseminated – Confirm msg overrides any previous Suspect or Alive msg. • Suspicion Time-out – tradeoff the rate of false failure detection to detection time. 30 Round-Robin Probe Target Selection • In the basic SWIM – Eventual Strong completeness. – However, the detection time could be unbounded. • Goal: providing time-bounded strong completeness • Modification – Select ping target in a round-robin fashion. – After each traversal, randomly reorder the list. – a newly joining member is inserted in the membership list at a position that is chosen at random. 31 Performance Evaluation Per-node overhead don’t vary with the group size. 32 Performance Evaluation T1 Time to First Detection of a process failure apparently uncorrelated to group size. 33 Thanks! Q&A 34 Experiments Set-Up • Current implementation – Win2K, uses Winsock 2 – Uses only UDP messaging • Experimental platform – Heterogeneous cluster of commodity PCs (>32) – 100 Mbps Ethernet • Default protocol settings – Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)] – Each infection was piggybacked on at most 3*ceil[log(N+1)] msg – Suspicion timeouts = 3*ceil[log(N+1)] • No partial membership lists observed in experiments 35