CS514: Intermediate Course in Operating Systems Lecture 10: Sept. 26

advertisement
CS514: Intermediate
Course in Operating
Systems
Professor Ken Birman
Ben Atkin: TA
Lecture 10: Sept. 26
Agreement on
Membership
• Recall our approach:
– Detecting failure is a lost cause.
• Too many things can mimic failure
• To be accurate would end up waiting for a
process to recover
– Substitute agreement on membership
• Now we can drop a process because it isn’t
fast enough
• This can seem “arbitrary”, e.g. A kills B…
• GMS implements this service for
everyone else
Architecture
Applications use replicated data for
high availability
3PC-like protocols use membership
changes instead of failure notification
Membership Agreement, “join/leave”
and “P seems to be unresponsive”
Architecture
Application processes
membership views
A
leave
B
join
GMS processes
{A,B,D}
join
C
D
{A}
A seems to have failed
{A,D}
GMS
X
Y
Z
{A,D,C}
{D,C}
Contrast dynamic with
static model
• Static model: fixed set of processes “tied”
to resources
– Processes may be unreachable (while failed or
partitioned away) but later recover
– Think: “cluster of PCs”
• Dynamic model: changing set of processes
launched while system runs, some
fail/terminate
– Failed processes never recover (partitioned
process may reconnect, but uses a new pid)
– And can still own a physical resource, allowing
us to emulate a static model
Consistency options
• Could require that system always be
consistent with actions taken at a
process even if that process fails
immediately after taking the action
– This property is needed in systems that
take external actions, like advising an
air traffic controller
– May not be needed in high availability
systems
• Alternative is to require that
operational part of system remain
continuously self-consistent
Obstacles to progress
• Fischer, Lynch and Patterson result:
proof that agreement protocols
cannot be both externally consistent
and live in asynchronous
environments
• Suggests that choice between
internal consistency and external
consistency is a fundamental one!
• Can show that this result also
applies to dynamic membership
problems
Usual response to FLP:
Chandra/Toueg
• Consider system as having a failure
detector that provides input to the
basic system itself
• Agreement protocols within system
are considered safe and live if they
satisfy their properties and are live
when the failure detector is live
• Babaoglu: expresses similar result in
terms of reachability of processes:
protocols are live during periods of
reachability
Towards an Alternative
• In this lecture, focus on
systems with self-defined
membership
• Idea is that if p can’t talk to q it
will initiate a membership
change that removes q from p’s
system “membership view”
• Illustrated on next slide
Commit protocol from
last lecture
ok to commit?
decision
unknown!
ok
ok
vote
unknown!
Suppose this is a partitioning failure
ok to commit?
decision
unknown!
ok
ok
vote
unknown!
Do these processes actually need to be consistent with the others?
Primary partition
concept
• Idea is to identify notion of “the
system” with a unique component of
the partitioned system
• Call this distinguished component
the “primary” partition of the system
as a whole.
– Primary partition can speak with
authority for the system as a whole
– Non-primary partitions have weaker
consistency guarantees and limited
ability to initiate new actions
Ricciardi: Group
Membership Protocol
• For use in a group membership service
(usually just a few processes that run on
behalf of whole system)
• Tracks own membership; own members
use this to maintain membership list for
the whole system
• All user’s of the service see subsequences
of a single system-wide group membership
history
• GMS also tracks the primary partition
GMP protocol itself
• Used only to track membership of the
“core” GMS
• Designates one GMS member as the
coordinator
• Switches between 2PC and 3PC
– 2PC if the coordinator didn’t fail and other
members failed or are joining
– 3PC if the coordinator failed and some other
member is taking over as new coordinator
• Question: how to avoid “logical
partitioning”?
GMS majority
requirement
• To move from system “view” i to
view i+1, GMS requires explicit
acknowledgement by a majority of
the processes in view i
• Can’t get a majority: causes GMS to
lose its primaryness information
• Dahlia Malkhi has extended GMP to
support partitioning and remerging;
similar idea used by Yair Amir and
others in Totem system
GMS in Action
p0
p1
...
p5
p0 is the initial coordinator. p1 and p2 join, then p3...p5
join. But p0 fails during join protocol, and later so does
p3. Notice use of majority consent to avoid partitioning!
GMS in Action
p0
p1
...
p5
2-phase commit…
P0 is coordinator…
3-phase…
2–phase
P1 takes over… P1 is new coordinator
What if system has thousands
of processes?
• Idea is to build a GMS subsystem
that runs on just a few nodes
• GMS members track themselves
• Other processes ask to be admitted
to system or for faulty processes to
be excluded
• GMS treats overall system
membership as a form of replicated
data that it manages, reports to its
“listeners”
Uses of membership?
• If we rewire TCP and RPC to
use membership changes as
trigger for breaking
connections, can eliminate
split-brain problems!
– But nobody really does this
– Problem is that networks lack
standard GMS subsystems now!
• But we can still use it ourselves
Replicated data within
groups
• A very general requirement:
– Data actually managed by group
– Inputs and outputs, in a server
replicated for fault-tolerance
– Coordination and synchronization data
• Will see how to solve this, and then
will use solution to implement
“process groups” which are
subgroups of the overall system
membership
Replicated data
• Assume that we have a (dynamically
defined) group of processes G and
that its members manage a
replicated data item
• Goal: update by sending a multicast
to G
• Should be able to safely read any
copy “locally”
• Consider situation where members
of G may fail or recover
Some Initial
Assumptions
• For now, assume that we work directly on
the real network, not using Ricciardi’s GMS
• Later will need to put GMS in to solve a
problem this raises, but for now, the model
will be the very simple one: processes that
communicate using messages,
asynchronous network, crash failures
• We’ll also need our own implementation of
TCP-style reliable point-to-point channels
using GMS as input
Process group model
• Initially, we’ll assume we are simply
given the model
• Later will see that we can use
reliable multicast to implement the
model
• First approximation: a process group
is defined by a series of “views” of
its membership. All members see
the same sequence of view changes.
Failures, joins reported by changing
membership
Process groups with
joins, failures
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
State transfer
• Method for passing information
about state of a group to a
joining member
• Looks instantaneous, at time
the member is added to the
view
Outline of treatment
• First, look at reliability and failure
atomicity
• Next, look at options for “ordering”
in group multicast
• Next, discuss implementation of the
group view mechanisms themselves
• Finally, return to state transfer
• Outcome: process groups, group
communication, state transfer, and
fault-tolerance properties
Atomic delivery
• Atomic or failure atomic delivery
– If any process receives the message and
remains operational, all operational
destinations receive it
a
p
q
b
fails
fails
r
s
All processes that receive a subsequently fail.
All processes receive b.
Additional properties
• A multicast is dynamically
uniform if:
– If any process delivers the
multicast, all group members that
don’t fail will deliver it (even if the
initial recipient fails immediately
after delivery).
• Otherwise we say that the
multicast is “not uniform”
Uniform and non-uniform
delivery
a
p
q
b
fails
fails
r
s
Uniform delivery of a and b
a
p
q
b
fails
fails
r
s
Non-uniform delivery of a
Stronger properties cost
more
• Weaker ordering guarantees are
cheaper than stronger ones
• Non-uniform delivery is cheap
• Dynamic uniformity is costly
• Dynamic membership is cheap
• Static membership is more
costly
static group
Conceptual cost graph
uniform and globally total
“abcast” in a static group
non-uniform, dynamic group
uniform
Total, safe abcast in Totem or
Transis: 600/second, 750ms
latency sender to dest
cbcast in Horus:
85,000/second, 85us latency
sender to dest
asynchronous and non-uniform
“cbcast” to dynamically defined group
less ordered
local total order
global total order
Implementing multicast
primitives
• Initially assume a static process
group
• Crash failures: permanent failures, a
process fails by crashing
undetectably. No GMS (at first).
• Unreliable communication:
messages can be lost in the
channels
• ... looks like the asynchronous model
of FLP
Failures?
• Message loss: overcome with
retransmission
• Process failures: assume they “crash”
silently
• Network failures: also called “partitioning”
• Can’t distinguish between these cases!
p
timeout: q failed!
network partitions
q
timeout: p failed!
Multicast by “flooding”
• All recipients echo message to all other
recipients, O(n2) messages exchanged
• Reject duplicates on basis of message id
a
p
q
fails
fails
r
s
• When can we garbage collect the id?
Multicast by “flooding”
• All recipients echo message to all other
recipients, O(n2) messages exchanged
• Reject duplicates on basis of message id
a
p
q
r
s
• When can we garbage collect the id?
Multicast by “flooding”
• All recipients echo message to all other
recipients, O(n2) messages exchanged
• Reject duplicates on basis of message id
a
p
q
fails
fails
r
s
• When can we garbage collect the id?
Multicast by “flooding”
• All recipients echo message to all other
recipients, O(n2) messages exchanged
• Reject duplicates on basis of message id
a
p
q
fails
fails
r
s
• When can we garbage collect the id?
Garbage collection
issue
• Must remember id as long as might
still see a duplicate copy
• If no process fails: garbage collect
after echoed by all destinations
• Very similar to 3PC protocol
... correctness of this protocol
depends upon having an accurate
way to detect failure! Return to this
point in a few minutes.
“Lazy” flooding and
garbage collection
• Idea is to delay “non urgent” messages
• Recipients delay the echo in hope that
sender will confirm successful delivery:
O(n) messages
a
p
q
r
s
ack...
“Lazy” flooding
• Recipients delay the echo in hope that
sender will confirm successful delivery:
O(n) messages
a
p
q
r
s
ack...
all got it...
“Lazy” flooding
• Recipients delay the echo in hope
that sender will confirm successful
delivery: O(n) messages
a
fails
p
fails
q
r
s
ack...
all got it... garbage collect
• Notice that garbage collection occurs in
3rd phase
“Lazy” flooding, delayed
phases
• “Background” acknowedgements (not
shown)
• Piggyback 2nd, 3rd phase on other
multicasts
m1
p
q
r
s
m1
“Lazy” flooding, delayed
phases
• “Background” acknowedgements (not
shown)
• Piggyback 2nd, 3rd phase on other
multicasts
m1 m 2
p
q
r
s
m1
m2, all got m
“Lazy” flooding, delayed
phases
• “Background” acknowedgements (not
shown)
• Piggyback 2nd, 3rd phase on other
multicasts
m1 m 2
m3
p
fails
q
r
s
m1
m2, all got m1 m3, gc m1
“Lazy” flooding, delayed
phases
• “Background” acknowedgements (not
shown)
• Piggyback 2nd, 3rd phase on other
multicasts
m1 m 2
m3
m4
p
fails
q
r
s
m1
m2, all got m1 m3, gc m1 m4, gc m2
• Reliable multicasts now look cheap!
Lazy scheme continued
• If sender fails, recipients switch
to flood-style algorithm
... but now we have the same
garbage collection problem: if
sender fails we may never be able
to garbage collect the id!
• Problem is caused by lack of
failure detector
Garbage collection with
inaccurate failure detections
• ... we lack an accurate way to detect
failure
– If any does seem to fail, but is really still
operational and merely partitioned
away, the connection might later be
fixed.
– That process might “wake up” and send
a duplicate
• Hence, if we are not sure a process
has failed, can’t garbage collect our
duplicate-supression data yet!
Exploiting a failure
detector
• Suppose that we had a failstop
environment
• Process group membership managed by
oracle, perhaps the GMS we saw earlier
• Failures reported as “new group views”
• All see the same sequence of views:
– G = {p,q,r,s} {p,r,s} {r,s}
• Now can assume failures are accurately
detected
Now our lazy scheme
works!
• Garbage collect when all non-faulty
processes are known to have
received the message
• Use process ranking to pick a new
“coordinator” if the initial one fails
• Cost only reaches n2 if many fail
during protocol
• Can delay 2nd, 3rd round if desired
• Also link GMS to point-to-point
channel implementation
Failure Detectors
• Needed as “input” to GMS. For now,
just assume we have one, perhaps
Vogel’s investigator
• In practice many systems use
“timeout”, but timeout is not safe for
our purposes
• Feeding detections through group
membership service converts
inaccurate failure detections into
what look like failstop failures for
processes within the system
Cutting Channels to
Failed Processes
• When a process is dropped from
the membership, break the
connection to it
• This will effectively eliminate
the risk of “late” delivery of
duplicate messages, etc.
• Makes a partitioning failure look
like a failstop failure.
Dynamic uniformity
• This property requires an extra
phase of communication
• Phase 1: distribute message
• Phase 2: can deliver if all non-faulty
processes received it in phase 1
• Insight: no process delivers a
message until all have received it
Summary
• We know how to build a GMS that
tracks its own membership
• We know how to build an unordered
reliable multicast
– Actually, “sender-ordered”
– But from different senders, can delivery
in arbitrary orders
• And we know how to support various
forms of uniformity
• Next: multicast ordering
Download