db - Cornell University

advertisement
1
A SHORT DETOUR: THE
THEORY OF CONSENSUS
Cornell University
Ken Birman
2
Consensus Problem
Systems like Isis2 are solving a hard problem
We will briefly look at the associated theory
State machine replication  consensus
3

The state machine idea is
 Very
elegant and simple – obvious in many ways
 Hopefully everyone understands it!

But notice that “a group of N processes” assumes
agreement on who is in the group!
 Without
agreement on membership, different processes
might have different views of group membership
 In this case some members might miss some updates!
Consensus Problem
4



When Isis2 is running, it automatically adjusts group
membership to handle failures
It also places concurrent multicasts into a fixed
order
In doing this it creates a powerful tool for solving
the so-called “consensus” or “agreement” problem
Consensus Problem
5

We imagine a set of processes, already running

Each process has some input from a sensor device
 For

simplicity, assume input values {0,1}
Our goal: agree on a single value such that
 All
group members use the same value
 The value was one of the actual inputs (e.g. if all
propose 1, we must use 1, if all propose 0, we use 0)
Consensus problem
6

Without failures consensus is an easy task!
Isis2: The process with rank 0 simply sends its value
to every other process: “The answer is 1”
 With
Isis2: Every process broadcasts its value to
every other process.
 Without
 All
collect N values, take the majority vote
 If N is even, break ties in some standard way
Consensus with failures
7

Must ask: “What sort of failures”?
 Simple
case: “Crash” failures (process stops)
 Hardest case: “Byzantine” failures (process may lie,
cheat, run the protocol incorrectly)

Both have been studied extensively but we will focus
on the crash failure case in our lectures
 Assume
a very simple version: there are either NO
failures, or exactly ONE failure
 In words “Consensus even if one member crashes”
Consensus with [at most] one crash
8

With Isis2 this problem is easy
 Have
the rank-0 process multicast its input
 If a new view becomes defined “instead” of the
decision being reported, have the new rank-0 member
multicast its input
 Clearly this solves consensus with zero or one crashes

With Isis2 it is easy to solve consensus 
Consensus without
2
Isis
9

In a famous theory paper, Fischer, Lynch and
Patterson formalized the fault-tolerant consensus
problem and studied it
showed that the problem cannot be solved! 
 Any correct solution must be at risk of endless delays
caused by mistaken detections of failures!
 They

But if this is true, how can Isis2 solve consensus?
How
2
Isis
solves consensus
10

First we build a subsystem (the Oracle) to track
membership in the system, and in process groups
 To
join, or handle failure, send a message to the Oracle
 It will report the event to the whole system


The system has an “easier task” because every
component sees the identical membership events
But the Oracle itself solves consensus!
FLP says it is impossible to solve consensus...
11



... but by impossible means “not always feasible”
In effect: There are situations in which progress
cannot be guaranteed
Indeed, Isis2 is not always able to make progress
 It
could hang, unable to advance
 But this is impractical to trigger in real life because the
situations are so complex to create
Is
2
Isis
correct?
12

Isis2 has an unavoidable problem
Certain rare failure sequences can confuse the system
 It will be unable to form new group membership views and
for this reason, won’t deliver extra multicasts
 Thus in fact consensus will not be “solved”


But these sequences are very rare
They require that specific messages be delayed, at specific
steps in the protocols, and then released after specific
amounts of time
 It is “not possible” to force such an exception to occur in
practice – instead, you are more likely to cause Isis2 to crash

This limitation is shared
13

You might read about other systems
a protocol by Lamport (Isis2 SafeSendPaxos)
 Zookeeper, used at Yahoo!
 Chubby, used at Google...
 Paxos,

... all of them share this same issue that we see with
Isis2! None is able to “always solve consensus”
 FLP
always applies
 But because the problem is so unlikely, we accept it as
a kind of inevitable limitation
14
The CAP Theorem: Brewer
You can have just 2 of {Consistency, Availability
and Partition or Fault Tolerance}
CAP theorem
15


A second important cloud computing question,
posed by Berkeley Professor Eric Brewer
He considered the many attempts to make cloud
computing systems behave like big databases
 Databases
use the ACID model: Atomicity, Consistency,
Isolation, Durability (also called Serializability)
 Brewer’s company, Inktomi, found that ACID was slow!
CAP definitions
16



C stands for Consistency
A stands for Availability: the system always
responds to every request without long delays
P stands for partitionability: even if some nodes
crash or are inaccessible due to network failure, it
still remains available
CAP theorem
17

In any wide area cloud computing service that has
two or more “places” where services are offered,
there will be a tradeoff
 You
can have at most 2 of C, A and P
 For example, to guarantee A+P, we lose Consistency!

How does CAP apply to systems like Isis2?
 Answer
is complicated
CAP and
2
Isis
18



Isis2 is normally used within a cluster or a single
data center, but in fact it CAN be configured in a
WAN setting if there are no firewalls
So the scenario for CAP arises
If it arises, Isis2 limits availability during a
partitioning fault: it offers C+A but limits P.
Isis2 has a strong consistency model: a new form
of virtual synchrony.
19
Virtual synchrony is a “consistency” model:



Membership epochs: begin when a new configuration is installed and
reported by delivery of a new “view” and associated state
Protocols run “during” a single epoch: rather than overcome failure, we
reconfigure when a failureTHIS
occurs IS WHERE CAP APPLIES!
A=3
B=7
B = B-A
Non-replicated reference execution
p
p
q
q
r
r
s
s
t
t
Time:
A=A+1
0
10
20
30
40
50
60
Synchronous execution
70
Time:
0
10
20
30
40
50
60
70
Virtually synchronous execution
The “optimistic early delivery” issue
20


In the previous lecture we mentioned that in Isis2
SafeSend is always safe, but that protocols such as
OrderedSend, Send, etc are “optimistic”
This relates to CAP
 SafeSend
will never deliver a message that could later
be lost, as in our example in the picture
 But it is slower

The picture illustrates an optimistic early delivery
Optimistic early delivery
21

Allows messages to be delivered before it is certain
that all group members will do the delivery
 Failures
can disrupt such a delivery
 In that case, perhaps the message is “lost”

Yet optimistic early delivery is much faster!
What about other group systems
22

We mentioned that Isis2 isn’t the only cloud
computing solution based on process groups
 Others

we mentioned: Spread, JGroups, Paxos
All of them encounter the same issues with FLP+CAP
 These
systems all provide strong guarantees
 They can solve consensus, so FLP applies
 The guarantees involve costs. CAP is ultimately about
those costs
2
Isis :
Send v.s. in-memory SafeSend
23
Send scales best, but SafeSend with
in-memory (rather than disk) logging and small
numbers of acceptors isn’t terrible.
g.Flush
24



When using optimistic early delivery, the
application should call g.Flush() before talking to
external users or external databases
This protocol delays until any optimistic early
delivery multicasts have been completed
g.OrderedSend+g.Flush  g.SafeSend!
Version 2: Send+Flush (Optimistic delivery virtually
synchronous FIFO multicast with a delay for stability)
25
Update the monitoring and
alarms criteria for Mrs. Marsh
as follows…
Execution timeline for an
individual first-tier replica
A
B
C
D
Soft-state first-tier service
Response delay seen
by end-user would
also include Internet
latencies
Send
Send
Local response
delay
Confirmed

Send
flush
In this example we use g.Send + g.Flush instead of g.SafeSend
SafeSend versus Send: looks “similar”
26
A
B
C
D
A
Soft-state first-tier service
B
Send
Send
SafeSend
Send
SafeSend

Bottom line: Which is fastest?
D
Soft-state first-tier service
SafeSend
Confirmed
C
Confirmed
flush
Flush delay as function of shard size
27
Flush is fairly fast if we only wait for
acks from 3-5 members, but slow
if we wait for all members.
Isis2 lets developer set the threshold.
Cornell (Birman): No distribution restrictions.
What does Flush do?
28


When an optimistic multicast is running, we track
how many deliveries have occured
g.Flush(k) delays until any pending multicasts have
been delivered to at least k members
 For
example g.Flush(2) means at least 2 members have
copies of the multicast
 Unless both fail, simultaneously, right now, the multicast
will be fully delivered.
Summary
29


To build a simple replication solution, use g.SafeSend
But to get the best performance, it is better to use
g.Send+g.Flush, with a small Flush reliability level
 As
a smart designer, you won’t make things complex: start
by using g.SafeSend
 But when tuning performance, consider shifting to
g.Send+g.Flush, if your application can be correct with
the weaker optimistic guarantees of g.Send!
30
Multicast ordering options
How “much” order do we need?
Other “flavors” of multicast
31





g.SafeSend() – Paxos
g.OrderedSend() – Like SafeSend but for inmemory data. “Optimistic delivery”
g.CausalSend() – Rarely used, implements causal
ordering. If x  y then delivers x before y
g.Send() – Fifo: if some process sends x, then y,
delivers x before y
g.RawSend() – like UDP multicast. Very weak
guarantees.
How we think about ordering
32

Always start by visualizing a step by step total
ordering
 As
if every program only used g.SafeSend
 Even if multicasts X and Y are sent concurrently, either X
will be delivered before Y everywhere, or Y before X
 Every member sees the identical ordering

But SafeSend is slow. So... let’s try and use a
weaker form of ordering
Using g.OrderedSend+g.Flush
33

This is just like SafeSend but it employs optimistic
early delivery
 Then
the g.Flush pauses until the optimistism is finished

A relatively simple and safe change for most cases

Speeds things up a lot anyhow!
Using g.Send+g.Flush
34


This is a similar idea but gives even faster code
g.Send only promises total ordering for messages
sent by the same sender
 Just
like TCP: a “FIFO” property
 If member P sends X, then Y, all receive X before Y
 But if X and Y are sent concurrently, order may vary!
When is g.Send “good enough”?
35


We can use g.Send if all the updates to a given
variable are sent by just a single member in the view
For example, suppose our group maintains object O
and only member P ever updates O
 Then
all the updates will be sent in some order by P
 ... and applied in the order they were sent
 ... and so O will remain consistent!
What if our group has many objects?
36

As long as each object is updated by just one member
– the “owner” for that object
 We
have many concurrent updates to distinct objects
 But each object receives updates in the exact order the
owner sent them
 And if a failure occurs, the multicast will still reach every
member

So the objects remain in consistent states! g.Send is all
we need
g.CausalSend and g.RawSend
37

g.CausalSend has the property that if process P
receives message X (from anyone) and then sends
message Y, message X will be delivered before Y
This is a very interesting concept and can be useful, but
because we have limited time, we will not look closely
 Book discusses it in more detail


g.RawSend doesn’t guarantee reliability
If X is sent before Y, by process P, then X is delivered before
Y or X is lost and not delivered at all
 RawSend is like the Internet IP multicast protocol

Point to point protocols
38

We have only discussed multicast protocols

In Isis2 you can also send point-to-point messages
 g.P2PSend
(dest, REQUEST, args...);
 g.P2PQuery(dest, REQUEST, args... EOL, results...);

The “dest” argument would be an Isis.Address for
some member. Access as v.member[i] in View v.
g.RawP2PSend
39


With RawSend, you can send an unreliable multicast
With RawP2PSend, you can send an unreliable
datagram
 Useful
for sending “unimportant” information or data
that needs to be very timely
 If lost, no effort is made to recover it
Subset multicast
40

In Isis2 we usually multicast to the entire group all at
once, but there are cases in which data is sharded
 This
means we have a large group but it holds data at
subsets of its members
 For example, a group with 100 members could hold 50
shards of size 2
 Typically arises for data that has some form of “key”
 (key,value)
pairs are common in distributed systems
Multicasting to a list of members
41

The basic idea is simple
 g.Send
and g.OrderedSend permit you to provide a list
of destinations, as do g.Query and g.OrderedQuery
 But ONLY these primitives support lists

You call g.Send(List<Address> dests, REQUEST, ...);
How can the members learn dests?
42


It often is important to know who else received the
same multicast (e.g. when doing a task in parallel)
It would be awkward to do this:
 g.Send(dests,
 ....
REQUEST, dests, args...)
so
 g.Send(REQUEST,

dests, args...) is allowed
Now dests is passed to your handler as an argument!
43
The question of Durability
SafeSend with and without DiskLogger
2
Isis
has two notions of durability
44

Earlier we saw that a group can be checkpointed

You need to enable the logging feature
g.Persistent(gname)
 gname will be the name of a file used to log the group
state for this particular member


The system will periodically put a checkpoint into
the file. You can control the frequency. On restart,
a chechpointed group will reload the saved state.
Durability with SafeSend
45

SafeSend (Paxos protocol) has a second concept of
durability
 This
protocol “remembers” the multicasts it delivers
 In this way we can safely update an external database
or file such that after a crash, every update is applied
exactly once

But using this feature requires understanding
durability concepts
Durability... inside... and external
46



A SafeSend is durable if the message will not be
lost until it is safe to garbage collect it
An update to an external database is durable if
every replica of the database has the update in it
Our challenge: use SafeSend to do externally
durable database updates
A replicated database with SafeSend
47
Update the monitoring and
alarms criteria for Mrs. Marsh
as follows…
A
Execution timeline for an
individual first-tier replica
B
C
D
Soft-state first-tier service
Response delay seen
by end-user would
also include Internet
latencies
SafeSend
SafeSend
Local response
delay
Confirmed


SafeSend
db
db
db
We will see that SafeSend is the safest Isis2 option
But it is also the most expensive
db
Failure concern
48
Update the monitoring and
alarms criteria for Mrs. Marsh
as follows…
A
Execution timeline for an
individual first-tier replica
B
C
D
Soft-state first-tier service
Response delay seen
by end-user would
also include Internet
latencies
SafeSend
SafeSend
Local response
delay
Confirmed

SafeSend
db
db
db
db
What if copy C has failed when updates are applied?
SafeSend is durable...
49

SafeSend will remember updates, but even so, copy
C has missed them because it was not running when
they were done!
 What

should we do when copy C recovers?
There are two main cases
 Case
1: In-memory data
 Case 2: Data stored externally in a disk
Case 1: In-memory data
50

For this case we can use g.SafeSend or even
g.OrderedSend+g.Flush
 When
the copy restarts we’ll do a state transfer to it
 C restarts with no memory of previous state so this is
necessary and sufficient
 It works well even for large amounts of state (we may
need to use the new Isis2 out-of-band data transfer
mechanism if the data is very large)
Case 2: C has external state
51

Suppose that when C recovers after a crash, it
restarts the local replica of a database or file
 This
assumes that the file or database is on the disk
 Thus it was not lost in the failure and can still be used
 But it may be missing many updates

SafeSend holds those updates. Our task: replay the
missing ones. But how can we learn which are
missing?
Identifying missing updates
52


In fact this is quite hard
Suppose that C crashes just as update X is being
executed by the database or file system
 Perhaps
X was completed and is still in the database
 Perhaps X was lost and is no longer in the database
 Worst case: perhaps X is partially applied and the file
or database is corrupted. Then some form of cleanup
will be needed. Isis2 doesn’t do this for you.
Replaying updates
53

You must enable the “DiskLogger” for SafeSend
g.SetSafeSendThreshold(3);
g.SetDurabilityMethod(new Group.DiskLogger(myGroup, gname + "-durabilitylog"));

Two steps: first, we tell SafeSend how many logged
copies of each message are needed for safety
If skipped, every group member must log all messages
 More than 2 is very conservative


Next, we give a file name for logging the updates

Isis2 will automatically create K copies, one per group
member, appending the member rank to the file name
On restart, SafeSend redelivers!
54

Until you call DiskLogger.Done() Isis2 SafeSend
assumes that your database replica might not
include the update yet
 So:
You apply the update
 Then tell Isis2 the update is Done()
 Note: for asynchronous updates, obtain a “completion
tag” from Isis2 for the action, supply it in Done()

The updates are delivered in order, but may repeat
after a failure
Your job?
55

After restart any “perhaps not done” updates are
redelivered to your handler
 You
must check to see if this update was already
applied to the file or database
 It is your task to find a way to check
 Apply, in order, if not yet done, then call Done()
 This way member C in our example can catch up
Failure concern
56
Update the monitoring and
alarms criteria for Mrs. Marsh
as follows…
A
B
C
D
Soft-state first-tier service
Response delay seen
by end-user would
also include Internet
latencies
SafeSend
SafeSend
Local response
delay
Confirmed
SafeSend
db
Restarts, applies missing
updates
db
db
db
db

SafeSend+DiskLogger allows C to catch up, but YOU need to
detect and ignore duplicated updates
Would it be easier to use Paxos?
57


Many people who read about SafeSend think that
perhaps using Paxos would be easier
But in fact SafeSend is a version of Paxos!
The issue is that Paxos, like SafeSend, maintains its own set
of logs, but then we need to apply the operations in the logs
to the external database
 Paxos would call the external replica a “learner”
 After recovery from a crash a learner must learn any
operations it missed. This is the exact problem SafeSend
solves using the DiskLogger

58
Putting Objects into Messages
Telling Isis2 about user-defined classes
[Automarshall]
2
Isis
moves data within messages
59

Isis2 has two data-moving subsystems
 The
primary one moves data within messages. When
you send a multicast, the message is created from the
arguments you supply.
 In 2013 a new version is being added. It moves data
out of band using memory-mapped files and is
intended for big objects. This kind of data is moved as
big byte arrays.

What kinds of data can be put into a message?
Messages
60

Internally, a message consists of a table listing the
contents, followed by the data in a compacted form,
followed by a checksum
 Isis2
already knows about built-in .NET data types (int,
double/float, byte, bool, etc), arrays, lists
 The system also understands Isis2 Address objects,
Views, Messages
 You can register additional data types of your own
Registering a new data type
61


Define a new class of your own
Two options
can request that the Isis2 “automarshaller” create
messages and recreate the objects.
 Fields must be public, and there must be a null
constructor for your class (one that takes no arguments)
 You
 Or
you can do the marshalling on your own, by
providing a method ToBArray() that returns a byte[]
and a constructor that takes a byte[] argument
A trivial example
62

You call Msg.RegisterType(typeof(myData), TID);
[AutoMarshalled]
public class myData
{
public int d;
public myData() { }
}

TID should be a small integer (1..127) used
throughout your collection of programs to identify
this kind of object.
Private fields are not marshalled
63

Many classes have some data that should be moved
from machine to machine, but other fields that do
not need to be moved
declare the ones Isis2 should include as public
and the others as internal or private
 Simply

Note that Isis2 can only handle nested objects if
they are of registered types that it knows about
Useful helper methods
64

It can be very helpful to know about:
 byte[]
ba = Msg.ToBArray(object, object, .....);
 object[] objs = Msg.BArrayToObjects(ba);
 object[] objs = Msg.BArrayToObjects(ba, Type[] types);
 Msg.InvokeFromBArray(buffer, delegate(args...) { .... } )

There are a number of additional such methods
2
Isis
messages...
65






... are strongly typed
... are encrypted when transmitted if you use
g.SetSecure() to put the group into secure mode
... automatically handle “endian” issues
... can be saved into files and read out of later,
even on a machine of some other type
... are language independent
... have a fairly “dense” (efficient) data
representation
What about large objects?
66

For very large objects (many megabytes or more)
Isis2 messages may not be efficient
 Transmission
can be slow and encoding involves copying
and perhaps fragmentation

For such cases we are implementing a new “out of
band” memory-mapped file transfer utility
 Expected
by end of summer 2013
 Big objects can be referenced as “attachments” like in
email, and moved using an ultra-efficient direct method
Summary
67

We have learned about the FLP and CAP theorems


Isis2 and other systems must live with these, but they do not
prevent us from guaranteeing correctness, consistency, faulttolerance and high performance
We looked at various message safety and ordering
properties
Tradeoffs between SafeSend and optimistic early delivery
with Send+Flush
 Durability for SafeSend, replicating an external database


Finally we looked at some special cases and at the Isis2
message subsystem
Download