Reliable Distributed Systems Fault Tolerance (Recoverability High Availability)

advertisement

Reliable Distributed Systems

Fault Tolerance

(Recoverability  High Availability)

Reliability and transactions

Transactions are well matched to database model and recoverability goals

Transactions don’t work well for non-database applications (general purpose O/S applications) or availability goals (systems that must keep running if applications fail)

When building high availability systems, encounter replication issue

Types of reliability

Recoverability

Server can restart without intervention in a sensible state

Transactions do give us this

High availability

System remains operational during failure

Challenge is to replicate critical data needed for continued operation

Replicating a transactional server

Two broad approaches

Just use distributed transactions to update multiple copies of each replicated data item

We already know how to do this, with 2PC

Each server has “equal status”

Somehow treat replication as a special situation

Leads to a primary server approach with a “warm standby”

Replication with 2PC

Our goal will be “1-copy serializability”

Defined to mean that the multi-copy system behaves indistinguishably from a single-copy system

Considerable form and theoretical work has been done on this

As a practical matter

Replicate each data item

Transaction manager

Reads any single copy

Updates all copies

Observation

Notice that transaction manager must know where the copies reside

In fact there are two models

Static replication set: basically, the set is fixed, although some members may be down

Dynamic: the set changes while the system runs, but only has operational members listed within it

Today stick to the static case

Replication and Availability

A series of potential issues

How can we update an object during periods when one of its replicas may be inaccessible?

How can 2PC protocol be made faulttolerant?

A topic we’ll study in more depth

But the bottom line is: we can’t!

Usual responses?

Quorum methods:

Each replicated object has an update and a read quorum

Designed so Q

> # replicas u

+Q r

> # replicas and Q u+

Q u

Idea is that any read or update will overlap with the last update

Quorum example

X is replicated at {a,b,c,d,e}

Possible values?

Q u

Q u

Q

Q

Q u u u

= 1, Q

= 2, Q

= 3, Q

= 4, Q r

= 5, Q r r r r

= 5 (violates Q

U

+Q

= 4 (same issue) u

= 3

= 2

=4, Q r

=2

> 5)

= 1 (violates availability)

Probably prefer Q u

Things to notice

Even reading a data item requires that multiple copies be accessed!

This could be much access performance slower than normal local

Also, notice that we won’t know if we succeeded in reaching the update quorum until we get responses

Implies that any quorum replication scheme needs a 2PC protocol to commit

Next issue?

Now we know that we can solve the availability problem for reads and updates if we have enough copies

What about for 2PC?

Need to tolerate crashes before or during runs of the protocol

A well-known problem

Availability of 2PC

It is easy to see that 2PC is not able to guarantee availability

Suppose that manager talks to 3 processes

And suppose 1 process and manager fail

The other 2 are “stuck” and can’t terminate the protocol

What can be done?

We’ll revisit this issue soon

Basically,

Can extend to a 3PC protocol that will tolerate failures if we have a reliable way to detect them

But network problems can be indistinguishable from failures

Hence there is no commit protocol that can tolerate failures

Anyhow, cost of 3PC is very high

A quandry?

We set out to replicate data for increased availability

And concluded that

Quorum scheme works for updates

But commit is required

And represents a vulnerability

Other options?

Other options

We mentioned primary-backup schemes

These are a second way to solve the problem

Based on the log at the data manager

Server replication

Suppose the primary sends the log to the backup server

It replays the log and applies committed transactions to its replicated state

If primary crashes, the backup soon catches up and can take over

Primary/backup

primary log backup

Clients initially connected to primary, which keeps backup up to date. Backup tracks log

Primary/backup

primary

Primary crashes. Backup sees the channel break, applies committed updates. But it may have missed the last few updates!

backup

Primary/backup

primary backup

Clients detect the failure and reconnect to backup. But some clients may have “gone away”. Backup state could be slightly stale.

New transactions might suffer from this

Issues?

Under what conditions should backup take over

Revisits the consistency problem seen earlier with clients and servers

Could end up with a “split brain”

Also notice that still needs 2PC to ensure that primary and backup stay in same states!

Split brain: reminder

primary log backup

Clients initially connected to primary, which keeps backup up to date. Backup follows log

Split brain: reminder

primary backup

Transient problem causes some links to break but not all.

Backup thinks it is now primary, primary thinks backup is down

Split brain: reminder

primary backup

Some clients still connected to primary, but one has switched to backup and one is completely disconnected from both

Implication?

A strict interpretation of ACID leads to conclusions that

There are no ACID replication schemes that provide high availability

Most real systems solve by weakening

ACID

Real systems

They use primary-backup with logging

But they simply omit the 2PC

Server might take over in the wrong state

(may lag state of primary)

Can use hardware to reduce or eliminate split brain problem

How does hardware help?

Idea is that primary and backup share a disk

Hardware is configured so only one can write the disk

If server takes over it grabs the “token”

Token loss causes primary to shut down (if it hasn’t actually crashed)

Reconciliation

This is the problem of fixing the transactions impacted by lack of 2PC

Usually just a handful of transactions

They committed but backup doesn’t know because never saw commit record

Later. server recovers and we discover the problem

Need to apply the missing ones

Also causes cascaded rollback

Worst case may require human intervention

Summary

Reliability can be understood in terms of

Availability: system keeps running during a crash

Recoverability: system can recover automatically

Transactions are best for latter

Some systems need both sorts of mechanisms, but there are “deep” tradeoffs involved

Replication and High

Availability

All is not lost!

Suppose we move away from the transactional model

Can we replicate data at lower cost and with high availability?

Leads to “virtual synchrony” model

Treats data as the “state” of a group of participating processes

Replicated update: done with multicast

Steps to a solution

First look more closely at 2PC, 3PC, failure detection

2PC and 3PC both “block” in real settings

But we can replace failure detection by consensus on membership

Then these protocols become non-blocking

(although solving a slightly different problem)

Generalized approach leads to ordered atomic multicast in dynamic process groups

Non-blocking Commit

Goal: a protocol that allows all operational processes to terminate the protocol even if some subset crash

Needed if we are to build high availability transactional systems (or systems that use quorum replication)

Definition of problem

Given a set of processes, one of which wants to initiate an action

Participants may vote for or against the action

Originator will perform the action only if all vote in favor; if any votes against (or don’t vote), we will “abort” the protocol and not take the action

Goal is all-or-nothing outcome

Non-triviality

Want to avoid solutions that do nothing

(trivial case of “all or none”)

Would like to say that if all vote for commit, protocol will commit

... but in distributed systems we can’t be sure votes will reach the coordinator!

 any “live” protocol risks making a mistake and counting a live process that voted to commit as a failed process, leading to an abort

Hence, non-triviality condition is hard to capture

Typical protocol

Coordinator asks all processes if they can take the action

Processes decide if they can and send back “ok” or “abort”

Coordinator collects all the answers (or times out)

Coordinator computes outcome and sends it back

Commit protocol illustrated

ok to commit?

Commit protocol illustrated

ok to commit?

ok with us

Commit protocol illustrated

ok to commit?

commit ok with us

Note: garbage collection protocol not shown here

Failure issues

So far, have implicitly assumed that processes fail by halting (and hence not voting)

In real systems a process could fail in arbitrary ways, even maliciously

This has lead to work on the “Byzantine generals” problem, which is a variation on commit set in a “synchronous” model with malicious failures

Failure model impacts costs!

Byzantine model is very costly: 3t+1 processes needed to overcome t failures, protocol runs in t+1 rounds

This cost is unacceptable for most real systems, hence protocols are rarely used

Main area of application: hardware faulttolerance, security systems

For these reasons, we won’t study such protocols

Commit with simpler failure model

Assume processes fail by halting

Coordinator detects failures (unreliably) using timouts. It can make mistakes!

Now the challenge is to terminate the protocol if the coordinator fails instead of, or in addition to, a participant!

Commit protocol illustrated

ok to commit?

… times out abort!

crashed!

ok with us

Note: garbage collection protocol not shown here

Example of a hard scenario

Coordinator starts the protocol

One participant votes to abort, all others to commit

Coordinator and one participant now fail

...

we now lack the information to correctly terminate the protocol!

Commit protocol illustrated

ok to commit?

decision unknown!

ok ok vote unknown!

Example of a hard scenario

Problem is that if coordinator told the failed participant to abort, all must abort

If it voted for commit and was told to commit, all must commit

Surviving participants can’t deduce the outcome without knowing how failed participant voted

Thus protocol “blocks” until recovery occurs

Skeen: Three-phase commit

Seeks to increase availability

Makes an unrealistic assumption that failures are accurately detectable

With this, can terminate the protocol even if a failure does occur

Skeen: Three-phase commit

Coordinator starts protocol by sending request

Participants vote to commit or to abort

Coordinator collects votes, decides on outcome

Coordinator can abort immediately

To commit, coordinator first sends a “prepare to commit” message

Participants acknowledge, commit occurs during a final round of “commit” messages

Three phase commit protocol illustrated

ok to commit?

prepare to commit commit ok ....

prepared...

Note: garbage collection protocol not shown here

Observations about 3PC

If any process is in “prepare to commit” all voted for commit

Protocol commits only when all surviving processes have acknowledged prepare to commit

After coordinator fails, it is easy to run the protocol forward to commit state (or back to abort state)

Assumptions about failures

If the coordinator suspects a failure, the failure is “real” and the faulty process, if it later recovers, will know it was faulty

Failures are detectable with bounded delay

On recovery, process must go through a reconnection protocol to rejoin the system!

(Find out status of pending protocols that terminated while it was not operational)

Problems with 3PC

With realistic failure detectors (that can make mistakes), protocol still blocks!

Bad case arises during “network partitioning” when the network splits the participating processes into two or more sets of operational processes

Can prove that this problem is not avoidable: there are no non-blocking commit protocols for asynchronous networks

Situation in practical systems?

Most use protocols based on 2PC: 3PC is more costly and ultimately, still subject to blocking!

Need to extend with a form of garbage collection mechanism to avoid accumulation of protocol state information (can solve in the background)

Some systems simply accept the risk of blocking when a failure occurs

Others reduce the consistency property to make progress at risk of inconsistency with failed proc.

Process groups

To overcome cost of replication will introduce dynamic process group model (processes that join, leave while system is running)

Will also relax our consistency goal: seek only consistency within a set of processes that all remain operational and members of the system

In this model, 3PC is non-blocking!

Yields an extremely cheap replication scheme!

Failure detection

Basic question: how to detect a failure

Wait until the process recovers. If it was dead, it tells you

I died, but I feel much better now

Could be a long wait

Use some form of probe

But might make mistakes

Substitute agreement on membership

Now, failure is a “soft” concept

Rather than “up” or “down” we think about whether a process is behaving acceptably in the eyes of peer processes

Architecture

Applications use replicated data for high availability

3PC-like protocols use membership changes instead of failure notification

Membership Agreement, “join/leave” and “P seems to be unresponsive”

Issues?

How to “detect” failures

Can use timeout

Or could use other system monitoring tools and interfaces

Sometimes can exploit hardware

Tracking membership

Basically, need a new replicated service

System membership “lists” are the data it manages

We’ll say it takes join/leave requests as input and produces “views” as output

Architecture

Application processes

B

D

A leave

C join join membership views

GMS processes

{A}

{A,B,D}

{A,D}

{A,D,C}

{D,C}

A seems to have failed

GMS

X Y Z

Issues

Group membership service (GMS) has just a small number of members

This core set will tracks membership for a large number of system processes

Internally it runs a

(GMP) group membership protocol

Full system membership list is just replicated data managed by GMS members, updated using multicast

GMP design

What protocol should we use to track the membership of GMS

Must avoid split-brain problem

Desire continuous availability

We’ll see that a version of 3PC can be used

But can’t “always” guarantee liveness

Reading ahead?

Read chapters 12, 13

Thought problem: how important is external consistency (called dynamic uniformity in the text)?

Homework: Read about FLP. Identify other

“impossibility results” for distributed systems.

What is the simplest case of an impossibility result that you can identify?

Download