Reintroducing Consistency in Cloud Settings

advertisement
Ken Birman, Cornell University
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
2
The “realtime web”
Simple ways to
create and share
collaboration and
social network
applications
[Try it! http://liveobjects.cs.cornell.edu]

Examples: Live Objects, Google “Wave”, Javascript/AJAX,
Silverlight, Java Fx, Adobe FLEX and AIR, etc….
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
3

Cloud computing entails building
massive distributed systems
 They use replicated data, sharded relational
databases, parallelism
 Brewer’s “CAP theorem:” Must sacrifice
Consistency for Availability & Performance


Cloud providers believe this theorem
My view:
Long ago, we knew how to build
reliable, consistent distributed
systems.
We gave up on consistency too easily

Partly, superstition….

… albeit backed by some painful experiences
Don’t believe me? Just ask the people who really know…

As described by Randy Shoup at LADIS 2008
Thou shalt…
1. Partition Everything
2. Use Asynchrony Everywhere
3. Automate Everything
4. Remember: Everything Fails
5. Embrace Inconsistency
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
7


Werner Vogels is CTO at Amazon.com…
His first act? He banned reliable multicast*!
 Amazon was troubled by platform instability
 Vogels decreed: all communication via SOAP/TCP

This was slower… but
 Stability matters more than speed
* Amazon was (and remains) a heavy pub-sub user
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
8


Key to scalability is decoupling,
loosest possible synchronization
Any synchronized mechanism is a risk
 His approach: create a committee
 Anyone who wants to deploy a highly consistent
mechanism needs committee approval
…. They don’t meet very often
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
9

Applications structured as stateless tasks
 Azure decides when and how much to replicate
them, can pull the plug as often as it likes
 Any consistent state lives in backend servers
running SQL server… but application design tools
encourage developers to run locally if possible
Consistency technologies
just don’t scale!
Sept 24,11,
2009
Sept
2009
Cornell
Dept of Computer
Science
Colloquium
P2P 2009
Seattle,
Washington
11

This is the common thread

All three guys (and Microsoft too)
 Really build massive data centers, that work
 And are opposed to “consistency mechanisms”
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
12
A consistent distributed system will often have
many components, but users observe
behavior indistinguishable from that of a
single-component reference system
Reference Model
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
Implementation
13

They reason this way:
 Systems that make guarantees put those guarantees
first and struggle to achieve them
 For example, any reliability property forces a system to
retransmit lost messages, use acks, etc
 But modern computers often become unreliable as a
symptom of overload… so these consistency
mechanisms will make things worse, by increasing the
load just when we want to ease off!

So consistency (of any kind) is a “root cause” for
meltdowns, oscillations, thrashing

Transactions that update replicated data

Atomic broadcast or other forms of reliable
multicast protocols

Distributed 2-phase locking mechanisms
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
15

Our systems become “eventually” consistent
but can lag far behind reality

Thus application developers are urged to not
assume consistency and to avoid anything
that will break if inconsistency occurs
A=3
B=7
A=A+1
B = B-A
Non-replicated reference execution
p
p
q
q
r
r
s
s
t
t
Time:
0
10
20
30
40
50
60
70
Synchronous execution


Time:
0
10
20
30
40
50
60
70
Virtually synchronous execution
Synchronous runs: indistinguishable from non-replicated
object that saw the same updates (like Paxos)
Virtually synchronous runs are indistinguishable from
synchronous runs
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
17

During the 1990’s, Isis was a big success
 French Air Traffic Control System, New York Stock
Exchange, US Navy AEGIS are some blue-chip
examples that used (or still use!) Isis
 But there were hundreds of less high-profile users

However, it was not a huge commercial success
 Focus was on server replication and in those days,
few companies had big server pools

messages /s

Leaving a collection of weaker products that,
nonetheless, were sometimes highly toxic
For example, publish-subscribe
message bus systems that use
IPMC are notorious for massive
disruption of data centers!
Among systems with strong consistency
models, only Paxos is widely used in cloud
systems (but its role is strictly for locking)
12000
10000
8000
6000
4000
2000
0
250

400
550 700
time (s)
850
My rent check bounced?
That can’t be right!

Inconsistency causes bugs
 Clients would never be able to
trust servers… a free-for-all

Jason Fane Properties
Sept 2009
1150.00
Tommy Tenant
Weak or “best effort” consistency?
 Strong security guarantees demand consistency
 Would you trust a medical electronic-health
records system or a bank that used “weak
consistency” for better scalability?
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
20

To reintroduce consistency we need
 A scalable model
▪ Should this be the Paxos model? The old Isis one?
 A high-performance implementation
▪ Can handle massive replication for individual objects
▪ Massive numbers of objects
▪ Won’t melt down under stress
▪ Not prone to oscillatory instabilities or resource
exhaustion problems

I’m reincarnating group communication!
 Basic idea: Imagine the distributed system as a
world of “live objects” somewhat like files
 They float in the network and hold data when idle
 Programs “import” them as needed at runtime
▪ The data is replicated but every local copy is accurate
▪ Updates, locking via distributed multicast; reads are
purely local; failure detection is automatic & trustworthy

A library… highly asynchronous…
Group g = new Group(“/amazon/something”);
g.register(UPDATE, myUpdtHandler);
g.cast(UPDATE, “John Smith”, new_salary);
public void myUpdtHandler(string empName,
double salary)
{ …. }

Just ask all the members to do “their share” of work:
Replies = g.query(LOOKUP, “Name=*Smith”);
g.callback(myReplyHndlr, Replies, typeof(double));
public void lookup(string who) {
divide work into viewSize() chunks
this replica will search chunk # getMyRank();
reply(myAnswer);
}
public void myReplyHndlr(double[] whatTheyFound) { … }
Group g = new Group(“/amazon/something”);
g.register(LOOKUP, myLookup);
Replies = g.query(LOOKUP, “Name=*Smith”);
public void myLookup(string who) {
divide work into viewSize() chunks
this replica will search chunk # getMyRank();
…..
reply(myAnswer);
}
g.callback(myReplyHndlr, Replies, typeof(double));
public void myReplyHndlr(double[] fnd) {
foreach(double d in fnd)
avg += d;
…
}


The group is just an object.
User doesn’t experience sockets…
marshalling… preprocessors… protocols…
 As much as possible, they just provide arguments
as if this was a kind of RPC, but no preprocessor
 Sometimes they provide a list of types and Isis
does a callback

Groups have replicas… handlers… a “current
view” in which each member has a “rank”

Can’t we just use Paxos?
 In recent work (collaboration with MSR SV) we’ve merged the models.
Our model “subsumes” both…

This new model is more flexible:
 Paxos is really used only for locking.
 Isis can be used for locking, but can also replicate data at very high
speeds, with dynamic membership, and support other functionality.
 Isis2 will be much faster than Paxos for most group replication
purposes (1000x or more)
[Building a Dynamic Reliable Service. Ken Birman, Dahlia Malkhi and Robbert van Renesse. Available as a 2009
technical report, in submission to SOCC 10 and ACM Computing Surveys...]

Unbreakable TCP connections that terminate in groups
 [Burgess ‘10] describes Robert Burgess’ new r-TCP solution
 Groups use some form of state machine replication scheme

State transfer and persistence

Locking, other coordination paradigms

2PC and transactional 1-copy SR

Publish-subscribe with topic or content filtering (or both)

Isis2 has a lot in common with an operating
system and is internally very complex
 Distributed communication layer manages
multicast, flow control, reliability, failure sensing
 Agreement protocols track group membership,
maintain group views, implement virtual synchrony
 Infrastructure services build messages, handle
callbacks, keep groups healthy

To scale really well we need to take full
advantage of the hardware: IPMC

But IPMC was the root cause of the oscillation
shown on the prior slide




Traditional IPMC systems can
overload the router, melt down
Issue is that routers have a small
“space” for active IPMC addresses
In [Vigfusson, et al ‘09] we show how to use
optimization to manage the IPMC space
In effect, merges similar groups while
respecting limits on the routers and switches
Melts down
at ~100
groups


Algorithm by Vigfusson, Tock [HotNets 09,
LADIS 2008, Submission to Eurosys 10]
Uses a k-means clustering algorithm
 Generalized problem is NP complete
 But heuristic works well in practice
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
32
o Assign
IPMC and unicast addresses s.t.
  %
receiver filtering
(hard)
 (1) Min. network traffic
 M
# IPMC addresses (hard)
• Prefers sender load over receiver load
• Intuitive control knobs as part of the policy
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
33
FGIF BEER GROUP
FREE FOOD
Topics in `userinterest’ space
(0,1,1,1,1,1,1,0,0,1,1,1)
(1,1,1,1,1,0,1,0,1,0,1,1)
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
34
224.1.2.4
224.1.2.3
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
224.1.2.5
Topics in `userinterest’ space
35
Topics in `userinterest’ space
Sending cost:
MAX
Filtering cost:
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
36
Unicast
Sending cost:
Topics in `userinterest’ space
MAX
Filtering cost:
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
37
Unicast
Unicast
Topics in `userinterest’ space
224.1.2.4
224.1.2.5
224.1.2.3
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
38
multicast
Heuristic
Procs
L-IPMC
Procs
L-IPMC
• Processes use “logical” IPMC addresses
• Dr. Multicast transparently maps these to
true IPMC addresses or 1:1 UDP sends
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
39

We looked at various group scenarios

Most of the traffic is
carried by <20% of groups

For IBM Websphere,
Dr. Multicast achieves
18x reduction in
physical IPMC addresses

[Dr. Multicast: Rx for Data Center Communication Scalability. Ymir
Vigfusson, Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, and Yoav
Tock. LADIS 2008. November 2008. Full paper submitted to Eurosys 10.]
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
40


For small groups, reliable multicast
protocols directly ack/nack the sender
For large ones, use QSM technique:
tokens circulate within a tree of rings
 Acks travel around the rings and aggregate over
members they visit (efficient token encodes data)
 This scales well even with many groups
 Isis2 uses this mode for |groups| > 25 members, with
each ring containing ~25 nodes

[Quicksilver Scalable Multicast (QSM). Krzys Ostrowski, Ken Birman, and
Danny Dolev. Network Computing and Applications (NCA’08), July 08. Boston.]

Needed to prevent bursts of multicast from
overrunning receivers

AJIL protocol imposes limits on IPMC rate
 AJIL monitors aggregated multicast rate
 Uses optimization to apportion bandwidth
 If limit exceeded, user perceives a “slower” multicast
channel

[Ajil: Distributed Rate-limiting for Multicast Networks. Hussam AbuLibdeh, Ymir Vigfusson, Ken Birman, and Mahesh Balakrishnan (Microsoft
Research, Silicon Valley). Cornell University TR. Dec 08.]
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
42


AJIL reacts rapidly to load surges, stays close
to targets (and we’re improving it steadily)
Makes it possible to eliminate almost all
IPMC message loss within the datacenter!
Sept 24, 2009
Cornell Dept of Computer Science Colloquium
43
Challenges
Solutions
Distributed computing is hard and our target
developers have limited skills
Make group communication look as natural
to the developer as building a .NET GUI
Raw performance is critical to success
Consistency at the “speed of light” by using
lossless IPMC to send updates
IPMC can trigger resource exhaustion and
loss by entering “promiscuous” mode,
overrunning receivers.
Optimization-based management of IPMC
addresses reduces # of IPMC groups 100:1.
AJIL flow control scheme prevents overload.
User’s will generate massive numbers of
groups, not just high rates of events
Aggregation, aggregation, aggregation… all
automated and transparent to users
Reliable protocols in massive groups result in
ack implosions
For big groups, deploy hierarchical ack/nack
rings (idea from Quicksilver)
Many existing group communication
systems are insecure
Use replicated group keys to secure
membership, sensitive data
What about C++ and Python on Linux?
Port platform to Linux with Mono, then offer
C++/Python supporting using remoting

Isis2 is coming soon… initially on .NET

Developers will think of distributed groups very
much as they think of objects in C#.
 A friendly, easy to understand model
 And under the surface, theoretically rigorous
 Yet fast and secure too

All the complexities of distributed computing
are swept into this library… users have a very
insulated and easy experience



.NET supports ~40 languages, all of which can
call Isis2 directly
On Linux, we’ll do a Mono port and then build
an outboard server that offers a remoted
library interface
C++ and other Linux languages/applications
will simply run off this server, unless they are
comfortable running under Mono of course

Code extensively leverages
 Reflection capabilities of C#, even when called
from one of the other .NET languages
 Component architecture of .NET means that users
will already have the right “mind set”
 Powerful prebuilt data types such as HashSets

All of this makes Isis2 simpler and more
robust; roughly a 3x improvement compared
to older C/C++ version of Isis!

Building this system (myself) as a sabbatical
project… code is mostly written

Goal is to run this system on 500 to 500,000
node systems, with millions of object groups

Initial byte-code only version will be released
under a freeBSD license.
Download