Virtual Synchrony

advertisement
Virtual Synchrony
Justin W. Hart
CS 614
11/17/2005
Papers


The Process Group Approach to Reliable
Distributed Computing. Birman. CACM,
Dec 1993, 36(12):37-53.
Understanding the Limitations of
Causally and Totally Ordered
Communication. Cheriton and
Skeen. 14th SOSP, 1993.
Background





Chandy-Lamport Logical Clocks
Consistent Cuts
Distributed Snapshots
Publish/Subscribe
Fail-Stop
Fail Stop



Group Membership Service
Processes appear to fail by halting
How does this affect the FLP result?
Motivation





Information Backplane
Customization
Hierarchical Structure
Fault-Tolerance
Reliability
Process Groups
Types of groups

Anonymous groups

Explicit groups
Implementation Requirements

Group communication

Group membership as input

Synchronization
Anonymous Groups




Group addressing
Messages sent exactly once to all or no
recipients
Ordering
Logging
Explicit Groups

Group members cooperate directly


May execute algorithms based on
membership knowledge
Communication is sensitive to membership
changes
Building groups over
conventional technology






Conventional message passing
technologies
Group addressing
Logical time & causal dependency
Message delivery ordering
State transfer
Fault tolerance
Close Synchrony

Close Synchrony

100% lock-step execution model
A synchronous execution
p
q
r
s
t
u

With true synchrony executions run in
genuine lock-step.
So… what’s wrong with that?

Under close synchrony, execution is
limited by the slowest process in the
group!
Virtual Synchrony



Relax synchronization requirements
where possible
Benefit by allowing for asynchronous
interactions
Do this where the result is identical to
close synchrony
A few protocols…




fbcast
cbcast
abcast
gbcast
Four protocols!?!?

…but Justin. The paper only discussed
2 protocols… you’re getting off-topic!
A few protocols…

fbcast






Simple protocol upon which we’ll build the others.
Delivery is FIFO ordered, with respect to the
original sender
Accomplished easily with a logical timestamp
cbcast
abcast
gbcast
Single updater

If p is the only update source, the need
is a bit like the TCP “fifo” ordering
1
p
2
3
4
r
s
t

fbcast is a good choice for this case
A few protocols…


fbcast
cbcast





Receipt is causally ordered
Protocol in paper uses token passing
Another simple protocol uses vector
timestamps
abcast
gbcast
Causally ordered updates

Simple protocol
based on token
passing
Causally ordered updates

p
r
s
t
Example: messages from p and s arrive
out of order at t
VT(b)=[1,0,0,1]
c is early: VT(c) = [1,0,1,1] but
VT(t)=[0,0,0,1]:
clearly we are
VT(c)
=
[1,0,1,1]
When b one
arrives,
we can
deliver
missing
message
from
s
both it and message c, in order
VT(a) = [0,0,0,1]
Causally ordered updates

Each thread corresponds to a different lock
2
p
5
1
r
3
s
t
1

4
2
In effect: red “events” never conflict with
green ones!
Hey… that sped things up!

Now I get it!
Processes only have
to wait for processes
that they depend
on. Not the slowest
in the group!
A few protocols…



fbcast
cbcast
abcast

Atomic delivery ordering




With respect to other abcasts
More costly than cbcast, but with a stronger
ordering property
ISIS builds abcast over cbcast
gbcast
A few protocols…




fbcast
cbcast
abcast
gbcast

Atomic delivery ordering

With respect to everything
Three Round Multicast
As a time-line picture
Phase 1
2PC
initiator
Vote?
Phase 2
Commit!
p
q
r
s
t
All vote “commit”
Just one more…
Flush protocol

We say that a message is unstable if
some receiver has it but (perhaps)
others don’t


For example, q’s message is unstable at
process r
If q fails we want to “flush” unstable
messages out of the system
Styles of groups

Peer Groups


Client-Server Groups



Group acts as a server
Client multicasts repeatedly to the group
Diffusion Groups



Processes cooperate closely
Group serves information
Clients connect to receive data from group
Hierarchical Groups

Offer scalability through a hierarchy of connected
groups
Historical Aside

Two major classes of real systems

Virtual synchrony




Weaker properties – not quite “FLP consensus”
Much higher performance (orders of magnitude)
Requires that majority of system remain connected.
Partitioning failures force protocols to wait for repair
Quorum-based state machine protocols are



Closer to FLP definition of consensus
Slower (by orders of magnitude)
Sometimes can make progress in partitioning situations
where virtual synchrony can’t
Names of some famous systems

Isis was first practical virtual synchrony system




Paxos was first major state machine system


Later followed by Transis, Totem, Horus
Today: Best options are Jgroups, Spread, Ensemble
Technology is now used in IBM Websphere and
Microsoft Windows Clusters products!
BASE and other Byzantine Quorum systems now
getting attention from the security community
(End of Historical aside)
Sounds good… what’s wrong
with it?



Tries to solve state problems at
communication level
This violates the end-to-end argument!
Consistency requirements are typically
stated with respect to application state
Stable vs Durable


Stable – messages are buffered until
received by all group members
Durable – message will be delivered,
even if the sender dies
Ordering semantics



Incidental Ordering
Semantic Ordering
Prescriptive Ordering
The problem with CATOCS




It
It
It
It
can’t
can’t
can’t
can’t
say
say
say
say
“for sure”
the “whole story”
“together”
it efficiently
It can’t say “for sure”

Processes
communicating over
a “hidden” channel



Common database
Shared memory
Two threads
reacting to external
event
It can’t say “together”



Standard solution – locking
Transaction models allow for abort and
rollback
Higher level conditions… what happens
if a message arrives, but is not
successfully processed
Stock trading example
Can’t say the “whole story”





Not everything can be expressed through the
“happens-before” relationship
Semantic ordering constraints
Causal memory, the weakest of these, cannot
be expressed in causal multicast
Total ordering helps some of these, but is far
too expensive
Inexpensive, state-level protocols with logical
clocks can solve these
It can’t say it efficiently

False causality



Potential causality != Actual causality
Memory requirements for buffering
“unstable” messages
Ordering information during
transmission and reception
And… what of the end to end
argument?

All of this considers our communication
channels… isn’t the application-level
check far more important?
Classes of distributed
applications

Data dissemination







Netnews
Trading application example
Global predicate evaluation
Transactional applications
Replicated data
Replication in the large
Distributed real-time applications
Implementing only part of the
messaging?

Can you cut down on overhead by
implementing only part of the
messaging using CATOCS?
Semantics

Are the semantics of state-based
approaches superior to those of virtual
synchrony?
Scalability


N Processes
Time T to propagate a message across
the system



Grows roughly proportional with the square
root of the number of processes
Arcs in the active causal graph grow
quadratically
Quadratic causal graph
Buffering grows





Quadratic arcs
Linear communication of causal dependencies
Linear growth in required buffering
Changing topologies doesn’t help
CATOCS would require separate process
groups for read and write to accomplish
optimization of updates vs queries
Group membership protocols

Must enforce atomic delivery semantics


Run our most expensive protocol… gbcast
Failures increase with the size of the
system, increasing load on the GMS
Who uses ISIS?


Brokerage
Database replication and triggers
ISIS-based utilities

NEWS


A pub/sub application with that will replay
histories
NMGR


Manages batch-style jobs and performs
load sharing
Parallel make
ISIS-based utilities

DECEIT


META/LOMITA




NFS compatible file system
Sensors & actuators
Abstract sensors
Specify control actions in high-level terms
SPOOLER/LONG-HAUL FACILITY
Now… somewhat supported






ISIS/Horus/Ensemble/QuickSilver
JGroups
Spread
Totem
Transis
WebSphere & Windows Cluster
(internally)
…and people actually use it.



NYSE
French ATC System
AEGIS
An ongoing debate


The effort continues here at Cornell
with the QuickSilver effort
You’ve been presented the options…
what are your conclusions?
References





Some slides borrowed from Ken Birman’s CS 614 slide sets on Virtual
Synchrony
http://www.cs.cornell.edu/courses/cs514/2005sp/Slide%20Sets.htm
Images have been borrowed from The Process Group Approach to
Reliable Distributed Computing. Birman. CACM, Dec 1993, 36(12):3753.
Images have been borrowed from Understanding the Limitations of
Causally and Totally Ordered Communication. Cheriton and
Skeen. 14th SOSP, 1993.
Statements and ideas have been borrowed verbatim from both papers,
including section headings, and statements in notes. This has been
mostly for coherence between the slides and papers
Also sourced data from http://www.cs.cornell.edu/ken/
Download