Fault Tolerance
(Recoverability High Availability)
Transactions are well matched to database model and recoverability goals
Transactions don’t work well for non-database applications (general purpose O/S applications) or availability goals (systems that must keep running if applications fail)
When building high availability systems, encounter replication issue
Recoverability
Server can restart without intervention in a sensible state
Transactions do give us this
High availability
System remains operational during failure
Challenge is to replicate critical data needed for continued operation
Two broad approaches
Just use distributed transactions to update multiple copies of each replicated data item
We already know how to do this, with 2PC
Each server has “equal status”
Somehow treat replication as a special situation
Leads to a primary server approach with a “warm standby”
Our goal will be “1-copy serializability”
Defined to mean that the multi-copy system behaves indistinguishably from a single-copy system
Considerable form and theoretical work has been done on this
As a practical matter
Replicate each data item
Transaction manager
Reads any single copy
Updates all copies
Notice that transaction manager must know where the copies reside
In fact there are two models
Static replication set: basically, the set is fixed, although some members may be down
Dynamic: the set changes while the system runs, but only has operational members listed within it
Today stick to the static case
A series of potential issues
How can we update an object during periods when one of its replicas may be inaccessible?
How can 2PC protocol be made faulttolerant?
A topic we’ll study in more depth
But the bottom line is: we can’t!
Quorum methods:
Each replicated object has an update and a read quorum
Designed so Q
> # replicas u
+Q r
> # replicas and Q u+
Q u
Idea is that any read or update will overlap with the last update
X is replicated at {a,b,c,d,e}
Possible values?
Q u
Q u
Q
Q
Q u u u
= 1, Q
= 2, Q
= 3, Q
= 4, Q r
= 5, Q r r r r
= 5 (violates Q
U
+Q
= 4 (same issue) u
= 3
= 2
=4, Q r
=2
> 5)
= 1 (violates availability)
Probably prefer Q u
Even reading a data item requires that multiple copies be accessed!
This could be much access performance slower than normal local
Also, notice that we won’t know if we succeeded in reaching the update quorum until we get responses
Implies that any quorum replication scheme needs a 2PC protocol to commit
Now we know that we can solve the availability problem for reads and updates if we have enough copies
What about for 2PC?
Need to tolerate crashes before or during runs of the protocol
A well-known problem
It is easy to see that 2PC is not able to guarantee availability
Suppose that manager talks to 3 processes
And suppose 1 process and manager fail
The other 2 are “stuck” and can’t terminate the protocol
We’ll revisit this issue soon
Basically,
Can extend to a 3PC protocol that will tolerate failures if we have a reliable way to detect them
But network problems can be indistinguishable from failures
Hence there is no commit protocol that can tolerate failures
Anyhow, cost of 3PC is very high
We set out to replicate data for increased availability
And concluded that
Quorum scheme works for updates
But commit is required
And represents a vulnerability
Other options?
We mentioned primary-backup schemes
These are a second way to solve the problem
Based on the log at the data manager
Suppose the primary sends the log to the backup server
It replays the log and applies committed transactions to its replicated state
If primary crashes, the backup soon catches up and can take over
primary log backup
Clients initially connected to primary, which keeps backup up to date. Backup tracks log
primary
Primary crashes. Backup sees the channel break, applies committed updates. But it may have missed the last few updates!
backup
primary backup
Clients detect the failure and reconnect to backup. But some clients may have “gone away”. Backup state could be slightly stale.
New transactions might suffer from this
Under what conditions should backup take over
Revisits the consistency problem seen earlier with clients and servers
Could end up with a “split brain”
Also notice that still needs 2PC to ensure that primary and backup stay in same states!
primary log backup
Clients initially connected to primary, which keeps backup up to date. Backup follows log
primary backup
Transient problem causes some links to break but not all.
Backup thinks it is now primary, primary thinks backup is down
primary backup
Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both
A strict interpretation of ACID leads to conclusions that
There are no ACID replication schemes that provide high availability
Most real systems solve by weakening
ACID
They use primary-backup with logging
But they simply omit the 2PC
Server might take over in the wrong state
(may lag state of primary)
Can use hardware to reduce or eliminate split brain problem
Idea is that primary and backup share a disk
Hardware is configured so only one can write the disk
If server takes over it grabs the “token”
Token loss causes primary to shut down (if it hasn’t actually crashed)
This is the problem of fixing the transactions impacted by lack of 2PC
Usually just a handful of transactions
They committed but backup doesn’t know because never saw commit record
Later. server recovers and we discover the problem
Need to apply the missing ones
Also causes cascaded rollback
Worst case may require human intervention
Reliability can be understood in terms of
Availability: system keeps running during a crash
Recoverability: system can recover automatically
Transactions are best for latter
Some systems need both sorts of mechanisms, but there are “deep” tradeoffs involved
All is not lost!
Suppose we move away from the transactional model
Can we replicate data at lower cost and with high availability?
Leads to “virtual synchrony” model
Treats data as the “state” of a group of participating processes
Replicated update: done with multicast
First look more closely at 2PC, 3PC, failure detection
2PC and 3PC both “block” in real settings
But we can replace failure detection by consensus on membership
Then these protocols become non-blocking
(although solving a slightly different problem)
Generalized approach leads to ordered atomic multicast in dynamic process groups
Goal: a protocol that allows all operational processes to terminate the protocol even if some subset crash
Needed if we are to build high availability transactional systems (or systems that use quorum replication)
Given a set of processes, one of which wants to initiate an action
Participants may vote for or against the action
Originator will perform the action only if all vote in favor; if any votes against (or don’t vote), we will “abort” the protocol and not take the action
Goal is all-or-nothing outcome
Want to avoid solutions that do nothing
(trivial case of “all or none”)
Would like to say that if all vote for commit, protocol will commit
... but in distributed systems we can’t be sure votes will reach the coordinator!
any “live” protocol risks making a mistake and counting a live process that voted to commit as a failed process, leading to an abort
Hence, non-triviality condition is hard to capture
Coordinator asks all processes if they can take the action
Processes decide if they can and send back “ok” or “abort”
Coordinator collects all the answers (or times out)
Coordinator computes outcome and sends it back
ok to commit?
ok to commit?
ok with us
ok to commit?
commit ok with us
Note: garbage collection protocol not shown here
So far, have implicitly assumed that processes fail by halting (and hence not voting)
In real systems a process could fail in arbitrary ways, even maliciously
This has lead to work on the “Byzantine generals” problem, which is a variation on commit set in a “synchronous” model with malicious failures
Byzantine model is very costly: 3t+1 processes needed to overcome t failures, protocol runs in t+1 rounds
This cost is unacceptable for most real systems, hence protocols are rarely used
Main area of application: hardware faulttolerance, security systems
For these reasons, we won’t study such protocols
Assume processes fail by halting
Coordinator detects failures (unreliably) using timouts. It can make mistakes!
Now the challenge is to terminate the protocol if the coordinator fails instead of, or in addition to, a participant!
ok to commit?
… times out abort!
crashed!
ok with us
Note: garbage collection protocol not shown here
Coordinator starts the protocol
One participant votes to abort, all others to commit
Coordinator and one participant now fail
...
ok to commit?
decision unknown!
ok ok vote unknown!
Problem is that if coordinator told the failed participant to abort, all must abort
If it voted for commit and was told to commit, all must commit
Surviving participants can’t deduce the outcome without knowing how failed participant voted
Thus protocol “blocks” until recovery occurs
Seeks to increase availability
Makes an unrealistic assumption that failures are accurately detectable
With this, can terminate the protocol even if a failure does occur
Coordinator starts protocol by sending request
Participants vote to commit or to abort
Coordinator collects votes, decides on outcome
Coordinator can abort immediately
To commit, coordinator first sends a “prepare to commit” message
Participants acknowledge, commit occurs during a final round of “commit” messages
ok to commit?
prepare to commit commit ok ....
prepared...
Note: garbage collection protocol not shown here
If any process is in “prepare to commit” all voted for commit
Protocol commits only when all surviving processes have acknowledged prepare to commit
After coordinator fails, it is easy to run the protocol forward to commit state (or back to abort state)
If the coordinator suspects a failure, the failure is “real” and the faulty process, if it later recovers, will know it was faulty
Failures are detectable with bounded delay
On recovery, process must go through a reconnection protocol to rejoin the system!
(Find out status of pending protocols that terminated while it was not operational)
With realistic failure detectors (that can make mistakes), protocol still blocks!
Bad case arises during “network partitioning” when the network splits the participating processes into two or more sets of operational processes
Can prove that this problem is not avoidable: there are no non-blocking commit protocols for asynchronous networks
Most use protocols based on 2PC: 3PC is more costly and ultimately, still subject to blocking!
Need to extend with a form of garbage collection mechanism to avoid accumulation of protocol state information (can solve in the background)
Some systems simply accept the risk of blocking when a failure occurs
Others reduce the consistency property to make progress at risk of inconsistency with failed proc.
To overcome cost of replication will introduce dynamic process group model (processes that join, leave while system is running)
Will also relax our consistency goal: seek only consistency within a set of processes that all remain operational and members of the system
In this model, 3PC is non-blocking!
Yields an extremely cheap replication scheme!
Basic question: how to detect a failure
Wait until the process recovers. If it was dead, it tells you
I died, but I feel much better now
Could be a long wait
Use some form of probe
But might make mistakes
Substitute agreement on membership
Now, failure is a “soft” concept
Rather than “up” or “down” we think about whether a process is behaving acceptably in the eyes of peer processes
Applications use replicated data for high availability
3PC-like protocols use membership changes instead of failure notification
Membership Agreement, “join/leave” and “P seems to be unresponsive”
How to “detect” failures
Can use timeout
Or could use other system monitoring tools and interfaces
Sometimes can exploit hardware
Tracking membership
Basically, need a new replicated service
System membership “lists” are the data it manages
We’ll say it takes join/leave requests as input and produces “views” as output
Application processes
B
D
A leave
C join join membership views
GMS processes
{A}
{A,B,D}
{A,D}
{A,D,C}
{D,C}
A seems to have failed
GMS
X Y Z
Group membership service (GMS) has just a small number of members
This core set will tracks membership for a large number of system processes
Internally it runs a
(GMP) group membership protocol
Full system membership list is just replicated data managed by GMS members, updated using multicast
What protocol should we use to track the membership of GMS
Must avoid split-brain problem
Desire continuous availability
We’ll see that a version of 3PC can be used
But can’t “always” guarantee liveness
Read chapters 12, 13
Thought problem: how important is external consistency (called dynamic uniformity in the text)?
Homework: Read about FLP. Identify other
“impossibility results” for distributed systems.
What is the simplest case of an impossibility result that you can identify?