CS514: Intermediate Course in Operating Systems Lecture 9: Sept. 21

advertisement
CS514: Intermediate
Course in Operating
Systems
Professor Ken Birman
Ben Atkin: TA
Lecture 9: Sept. 21
Conclusion?
• We set out to replicate data for
increased availability
• And concluded that
– Quorum scheme works for updates
– But commit is required
– And represents a vulnerability
• Other options?
Other options
• We mentioned primary-backup
schemes
• These are a second way to
solve the problem
• Based on the log at the data
manager
Server replication
• Suppose the primary sends the
log to the backup server
• It replays the log and applies
committed transactions to its
replicated state
• If primary crashes, the backup
soon catches up and can take
over
Primary/backup
primary
log
backup
Clients initially connected to primary, which keeps
backup up to date. Backup tracks log
Primary/backup
primary
backup
Primary crashes. Backup sees the channel break,
applies committed updates. But it may have missed
the last few updates!
Primary/backup
primary
backup
Clients detect the failure and reconnect to backup. But
some clients may have “gone away”. Backup state could
be slightly stale. New transactions might suffer from this
Issues?
• Under what conditions should
backup take over
– Revisits the consistency problem
seen earlier with clients and
servers
– Could end up with a “split brain”
• Also notice that still needs 2PC
to ensure that primary and
backup stay in same states!
Split brain: reminder
primary
log
backup
Clients initially connected to primary, which keeps
backup up to date. Backup follows log
Split brain: reminder
primary
backup
Transient problem causes some links to break but not all.
Backup thinks it is now primary, primary thinks backup is down
Split brain: reminder
primary
backup
Some clients still connected to primary, but one has switched
to backup and one is completely disconnected from both
Implication?
• A strict interpretation of ACID
leads to conclusions that
– There are no ACID replication
schemes that provide high
availability
• Most real systems solve by
weakening ACID
Real systems
• They use primary-backup with
logging
• But they simply omit the 2PC
– Server might take over in the
wrong state (may lag state of
primary)
– Can use hardware to reduce or
eliminate split brain problem
How does hardware
help?
• Idea is that primary and backup
share a disk
• Hardware is configured so only one
can write the disk
• If server takes over it grabs the
“token”
• Token loss causes primary to shut
down (if it hasn’t actually crashed)
Reconciliation
• This is the problem of fixing the
transactions impacted by lack of 2PC
• Usually just a handful of transactions
– They committed but backup doesn’t
know because never saw commit record
– Later. server recovers and we discover
the problem
• Need to apply the missing ones
• Also causes cascaded rollback
• Worst case may require human intervention
Summary
• Reliability can be understood in
terms of
– Availability: system keeps running
during a crash
– Recoverability: system can recover
automatically
• Transactions are best for latter
• Some systems need both sorts of
mechanisms, but there are “deep”
tradeoffs involved
Replication and High
Availability
• All is not lost!
• Suppose we move away from the
transactional model
• Can we replicate data at lower cost
and with high availability?
– Leads to “virtual synchrony” model
– Treats data as the “state” of a group of
participating processes
– Replicated update: done with multicast
Steps to a solution
• First look more closely at 2PC, 3PC,
failure detection
– 2PC and 3PC both “block” in real settings
– But we can replace failure detection by
consensus on membership
– Then these protocols become nonblocking (although solving a slightly
different problem)
• Generalized approach leads to
ordered atomic multicast in dynamic
process groups
Non-blocking Commit
• Goal: a protocol that allows all
operational processes to
terminate the protocol even if
some subset crash
• Needed if we are to build high
availability transactional
systems (or systems that use
quorum replication)
Definition of problem
• Given a set of processes, one of
which wants to initiate an action
• Participants may vote for or against
the action
• Originator will perform the action
only if all vote in favor; if any votes
against (or don’t vote), we will
“abort” the protocol and not take the
action
• Goal is all-or-nothing outcome
Non-triviality
• Want to avoid solutions that do
nothing (trivial case of “all or none”)
• Would like to say that if all vote for
commit, protocol will commit
... but in distributed systems we can’t be
sure votes will reach the coordinator!
– any “live” protocol risks making a
mistake and counting a live process that
voted to commit as a failed process,
leading to an abort
• Hence, non-triviality condition is
hard to capture
Typical protocol
• Coordinator asks all processes
if they can take the action
• Processes decide if they can
and send back “ok” or “abort”
• Coordinator collects all the
answers (or times out)
• Coordinator computes outcome
and sends it back
Commit protocol
illustrated
ok to commit?
Commit protocol
illustrated
ok to commit?
ok with us
Commit protocol
illustrated
ok to commit?
ok with us
commit
Note: garbage collection protocol not shown here
Failure issues
• So far, have implicitly assumed that
processes fail by halting (and hence
not voting)
• In real systems a process could fail
in arbitrary ways, even maliciously
• This has lead to work on the
“Byzantine generals” problem, which
is a variation on commit set in a
“synchronous” model with malicious
failures
Failure model impacts
costs!
• Byzantine model is very costly: 3t+1
processes needed to overcome t
failures, protocol runs in t+1 rounds
• This cost is unacceptable for most
real systems, hence protocols are
rarely used
• Main area of application: hardware
fault-tolerance, security systems
• For these reasons, we won’t study
such protocols
Commit with simpler
failure model
• Assume processes fail by halting
• Coordinator detects failures
(unreliably) using timouts. It can
make mistakes!
• Now the challenge is to terminate
the protocol if the coordinator fails
instead of, or in addition to, a
participant!
Commit protocol
illustrated
ok to commit?
… times out
abort!
crashed!
Note: garbage collection protocol not shown here
ok with us
Example of a hard
scenario
• Coordinator starts the protocol
• One participant votes to abort,
all others to commit
• Coordinator and one participant
now fail
... we now lack the information to
correctly terminate the
protocol!
Commit protocol
illustrated
ok to commit?
decision
unknown!
ok
ok
vote
unknown!
Example of a hard
scenario
• Problem is that if coordinator told
the failed participant to abort, all
must abort
• If it voted for commit and was told to
commit, all must commit
• Surviving participants can’t deduce
the outcome without knowing how
failed participant voted
• Thus protocol “blocks” until
recovery occurs
Skeen: Three-phase
commit
• Seeks to increase availability
• Makes an unrealistic
assumption that failures are
accurately detectable
• With this, can terminate the
protocol even if a failure does
occur
Skeen: Three-phase
commit
• Coordinator starts protocol by sending
request
• Participants vote to commit or to abort
• Coordinator collects votes, decides on
outcome
• Coordinator can abort immediately
• To commit, coordinator first sends a
“prepare to commit” message
• Participants acknowledge, commit occurs
during a final round of “commit” messages
Three phase commit
protocol illustrated
ok to commit?
prepare to
commit
ok ....
prepared...
commit
Note: garbage collection protocol not shown here
Observations about 3PC
• If any process is in “prepare to
commit” all voted for commit
• Protocol commits only when all
surviving processes have
acknowledged prepare to commit
• After coordinator fails, it is easy to
run the protocol forward to commit
state (or back to abort state)
Assumptions about
failures
• If the coordinator suspects a failure,
the failure is “real” and the faulty
process, if it later recovers, will
know it was faulty
• Failures are detectable with
bounded delay
• On recovery, process must go
through a reconnection protocol to
rejoin the system! (Find out status
of pending protocols that terminated
while it was not operational)
Problems with 3PC
• With realistic failure detectors (that can
make mistakes), protocol still blocks!
• Bad case arises during “network
partitioning” when the network splits the
participating processes into two or more
sets of operational processes
• Can prove that this problem is not
avoidable: there are no non-blocking
commit protocols for asynchronous
networks
Situation in practical
systems?
• Most use protocols based on 2PC: 3PC is
more costly and ultimately, still subject to
blocking!
• Need to extend with a form of garbage
collection mechanism to avoid
accumulation of protocol state information
(can solve in the background)
• Some systems simply accept the risk of
blocking when a failure occurs
• Others reduce the consistency property to
make progress at risk of inconsistency
with failed proc.
Process groups
• To overcome cost of replication will
introduce dynamic process group
model (processes that join, leave
while system is running)
– Will also relax our consistency goal:
seek only consistency within a set of
processes that all remain operational
and members of the system
– In this model, 3PC is non-blocking!
• Yields an extremely cheap
replication scheme!
Failure detection
• Basic question: how to detect a failure
– Wait until the process recovers. If it
was dead, it tells you
• I died, but I feel much better now
• Could be a long wait
– Use some form of probe
• But might make mistakes
– Substitute agreement on membership
• Now, failure is a “soft” concept
• Rather than “up” or “down” we think about
whether a process is behaving acceptably in
the eyes of peer processes
Architecture
Applications use replicated data for
high availability
3PC-like protocols use membership
changes instead of failure notification
Membership Agreement, “join/leave”
and “P seems to be unresponsive”
Issues?
• How to “detect” failures
– Can use timeout
– Or could use other system monitoring
tools and interfaces
– Sometimes can exploit hardware
• Tracking membership
– Basically, need a new replicated service
– System membership “lists” are the data
it manages
– We’ll say it takes join/leave requests as
input and produces “views” as output
Architecture
Application processes
membership views
A
leave
B
join
GMS processes
{A,B,D}
join
C
D
{A}
A seems to have failed
{A,D}
GMS
X
Y
Z
{A,D,C}
{D,C}
Issues
• Group membership service (GMS)
has just a small number of members
– This core set will tracks membership for
a large number of system processes
– Internally it runs a group membership
protocol (GMP)
• Full system membership list is just
replicated data managed by GMS
members, updated using multicast
GMP design
• What protocol should we use to
track the membership of GMS
– Must avoid split-brain problem
– Desire continuous availability
• We’ll see that a version of 3PC
can be used
• But can’t “always” guarantee
liveness
Reading ahead?
• Read chapters 12, 13
• Thought problem: how important is
external consistency (called
dynamic uniformity in the text)?
• Homework: Read about FLP. Identify
other “impossibility results” for
distributed systems. What is the
simplest case of an impossibility
result that you can identify?
Download