filename*=utf8''DCS_13_Membership_Wei_Yuan&response-cache

advertisement
DCS 13. Membership Protocols
Wei Yuan
2014-1-10
Outline
• Basics
• Gossip-style failure detection service
(Middleware,1998)
• SWIM: Scalable Weakly-consistent Infection-style
Process Group Membership Protocol (DSN,2002)
2
Membership
• In dynamic distributed system, a node needs knowledge of the
states(alive/failed) of other nodes
• Process group-based systems
• Membership
– Failure detection protocol
• Failure Detector (FD)
• Focus on fail-stop failures
– A process halts and does not execute any further operations
• Two styles of failure detectors
– Ping-Ack Protocol
– Heartbeating Protocol
Ping
๐‘ƒ๐‘–
๐‘ƒ๐‘–
ack
heartbeat
๐‘ƒ๐‘—
๐‘ƒ๐‘—
3
Membership Protocol
• Provides each process (“member”) of the group with a locallymaintained list of other non-faulty processes in the group.
• Ensures that the membership list is updated with changes
(members joining, or dropping out).
• Membership list is available to the application.
Application queries,
updates, etc.
Failure Detector
Dissemination
Membership
Protocol
Joins, leaves, fails
of members
Group Membership List
Unreliable Network
4
Evaluation Metrics for FD
• Completeness = every process failure is eventually detected (no
misses)
• Accuracy = every detected failure corresponds to a crashed
process (no mistaken detection)
• Speed
– First detection time
– Dissemination time
Failure
First detection
Detection by all nodes
Time axis
Detection time = First detection time
+ Dissemination time
5
• Scalability
– Equal Load on each member
– Network Message Load (in bytes per second generated by
the protocol)
• In practice
– Completeness : guaranteed
– Accuracy : probabilistic guarantee
6
• Distributed FD through heartbeating
– Centralized
– Ring
– All-to-all
7
Centralized Heartbeating
pi
๏ŒHotspot
pi, Heartbeat Seq. l++
pj
• Heartbeats sent periodically
• If heartbeat not received from pi
within timeout, mark pi as failed
8
Ring Heartbeating
pi
๏ŒUnpredictable on
simultaneous multiple
failures
pi, Heartbeat Seq. l++
pj
9
All-to-All Heartbeating
pi, Heartbeat Seq. l++
๏ŒScalability Issue
Unscalable: network
load O(N^2)
pi
…
pj
10
Outline
• Basics
• Gossip-style failure detection service (Middleware,1998)
• SWIM: Scalable Weakly-consistent Infection-style
Process Group Membership Protocol (DSN,2002)
11
Assumptions about the network
• No bound on message delivery time
– Most message delivered within reasonable time
• Failure Model
– fail-stop (no Byzantine, processes do not lie)
• Clock rate
– close to accurate (low drift), not synchronized
12
Basic protocol
• Each member maintains a list <๐‘€๐‘– , ๐ป๐‘– ,๐‘‡๐‘™๐‘Ž๐‘ ๐‘ก,๐‘€๐‘– >
– ๐‘€๐‘– : member address, ๐ป๐‘– : heartbeat counter, ๐‘‡๐‘™๐‘Ž๐‘ ๐‘ก,๐‘€๐‘– : last time of
heartbeat increase
• Every ๐‘‡๐‘”๐‘œ๐‘ ๐‘ ๐‘–๐‘ , each member
– Increments its heartbeat
– Select a random member and send a list of <๐‘€๐‘– , ๐ป๐‘– >
• A member, on receiving gossip message,
– Merge the list (maximum heartbeat for each member)
• Each member occasionally broadcasts its list
– To be located, recover from network partitions
• A few gossip servers,
– Lack of broadcast capability, well known addresses
13
Basic protocol
1
10118
64
2
10110
64
1
10120
66
3
10090
58
2
10103
62
4
10111
65
3
10098
63
4
10111
65
2
1
Address
Time (local)
Heartbeat Counter
4
Protocol:
•Nodes periodically gossip
their membership list
1
10120
70
2
10110
64
3
10098
70
4
10111
65
3
Current time : 70 at node 2
•On receipt, the local
membership list is updated
14
Basic protocol
• If the heartbeat has not increased for more than ๐‘‡๐‘“๐‘Ž๐‘–๐‘™ seconds,
the member is considered failed
• ๅฝ“ไธ€ไธช่Š‚็‚น่ขซ่ฎคไธบๆ˜ฏfaultyๆ—ถ๏ผŒไธ่ƒฝ็ซ‹ๅˆป่ขซๅˆ ้™คใ€‚
• After ๐‘‡๐‘๐‘™๐‘’๐‘Ž๐‘›๐‘ข๐‘ (๐‘‡๐‘๐‘™๐‘’๐‘Ž๐‘›๐‘ข๐‘ ≥ ๐‘‡๐‘“๐‘Ž๐‘–๐‘™ ) seconds, it will delete the
member from the list
• Only detect unreachable hosts, not detect link failures
– A๏ƒปB, A๏กC &B๏กC, A will not suspect B nor the other way around.
15
Analysis (1)
• Simplified Analysis
• define a different concept of round, in which only one member is
gossiping.
• Parrival is the probability that a gossip successfully arrives at a
chosen member before the start of the next round.
16
Analysis (2)
As # members increases, the detection
time increase
17
Analysis (3)
• There is a trade-off between minimizing the probability of making
a false detection, and the failure detection time.
• little extra detection time is necessary for better quality
As requirement is loosened, the detection time decrease.
18
Analysis (4)
For three different probabilities of message loss q, the effect of failed
members on detection time.
๏ผˆๅคฑๆ•ˆ่ฟ›็จ‹ๆ•ฐ่ฟ‡ๅŠ, not resilient๏ผ‰
As # failed members increases, the detection time increase significantly
19
Analysis(5)
• The effect of message loss on detection time (Pmistake = 10-6)
The algorithm is resilient to message loss
20
Problem
Bottleneck : cross-subnet link
21
Outline
• Concepts
• Gossip-style failure detection service
(Middleware,1998)
• SWIM: Scalable Weakly-consistent Infection-style
Process Group Membership Protocol (DSN,2002)
22
Why SWIM?
• Traditional heartbeating protocols
– Scalability problem
• SWIM
– Use a non heartbeat-based Failure Detection component.
– Separate the failure detection component and membership update
dissemination component of the membership protocol .
– Achieve scalability in a large group.
– Reduce false positive frequency.
23
The Basic SWIM Approach
• Two components:
A Failure Detector Component
A Dissemination Component
Mi
Mj
random Mj
ping
random K
ping-req
๏ผˆMj๏ผ‰
Protocol period
= T’ time units
K random
processes
ack
X
X
ping
ack
ack
24
The Basic SWIM Approach
• Two components:
A Failure Detector Component
A Dissemination Component
– Upon detecting the failure of another group member, multicasts failed( ๐‘€๐‘— )
msg to the rest of the group
25
Dissemination Options
• Multicast
– Only best-effort
– unreliable
• Point-to-point (TCP / UDP)
– expensive
• Zero extra messages: Piggyback on Failure Detector
messages
– Infection-style Dissemination
26
A Robust and Efficient SWIM
• Infection-Style Dissemination Component
• Suspicion Mechanism
• Round-Robin Probe Target Selection
27
Infection-Style Dissemination Component
– Piggyback on failure detector messages
– No extra packets for dissemination
– This epidemic process spreads exponentially fast in the group
28
Suspicion Mechanism
• Goal
– Reducing the frequency of false positive
• False detection, due to
– Network packet losses (e.g., from congestion)
– Slow process
• Indirect pinging may not solve the problem
– e.g., correlated message losses near pinged host
• Key:
– Avoid to drop a healthy process at the very first instance that it is
mistakenly detected as failed.
– suspect a process before declaring it as failed in the group
29
Suspicion Mechanism
• Mi pings Mj, if Mi receives no ack either directly or through
indirect probing subgroup
• Mi does not declare Mj as failed.
• Mi marks Mj as a suspected member
– {suspect Mj: Mi suspects Mj} msg is disseminated
• If later on, Ml successfully pings Mj
– {Alive Mj:Ml knows Mj is alive} un-marks suspicion
• If Mi times-out Mj’s suspicion before receipt of an Alive msg
– {Confirm Mj: Mi declares Mj as faulty} will be disseminated
– Confirm msg overrides any previous Suspect or Alive msg.
• Suspicion Time-out
– tradeoff the rate of false failure detection to detection time.
30
Round-Robin Probe Target Selection
• In the basic SWIM
– Eventual Strong completeness.
– However, the detection time could be unbounded.
• Goal: providing time-bounded strong completeness
• Modification
– Select ping target in a round-robin fashion.
– After each traversal, randomly reorder the list.
– a newly joining member is inserted in the membership list at a position
that is chosen at random.
31
Performance Evaluation
Per-node overhead
don’t vary with the
group size.
32
Performance Evaluation
T1
Time to First Detection of a process failure
apparently uncorrelated to group size.
33
Thanks!
Q&A
34
Experiments Set-Up
• Current implementation
– Win2K, uses Winsock 2
– Uses only UDP messaging
• Experimental platform
– Heterogeneous cluster of commodity PCs (>32)
– 100 Mbps Ethernet
• Default protocol settings
– Protocol period=2 s; K=1; G.C. and Suspicion timeouts=3*ceil[log(N+1)]
– Each infection was piggybacked on at most 3*ceil[log(N+1)] msg
– Suspicion timeouts = 3*ceil[log(N+1)]
• No partial membership lists observed in experiments
35
Download