Introduction to High Assurance Cloud Computing

advertisement
1
A BRIEF INTRODUCTION TO
HIGH ASSURANCE CLOUD
COMPUTING WITH ISIS2
Cornell University
Ken Birman
2
About the Lecturer
An introduction to the lecturer
Ken Birman
3




Researcher in high assurance computing
since joining Cornell in 1982 (PhD U.C. Berkeley).
Currently Cornell’s N. Rama Rao Professor of Computer
Science.
ACM Fellow, Winner of IEEE Tsutomu Kanai Award
Built the distributed software infrastructure used for a
decade by the New York Stock Exchange, and still used
in the French Air Traffic Control System, the US Navy
AEGIS and several other mission-criticial systems.
Contact information at http://www.cs.cornell.edu/ken
4
Segment I: The Cloud Landscape
Introducing terminology
Informal description of goals
High Assurance and the Cloud
5

Cloud Computing: The new universal standard
A
technology for federating network services
 Easy to share data, deeply integrated with web pages
 Supports a wide range of media types

But the cloud can’t offer high assurance today!
A
wave of sensitive applications is approaching (areas
like mHealth, Smart power grid, eBanking, Smart cars...)
 They need strong guarantees... what can we do to help?
How does today’s cloud work?
6



Client platform: browsers and “apps”, which are
programs that exploit a stripped-down browser API
Internet transports the data
Data centers run “web services” that produce the
pages we see, stream videos, etc
Each step embodies weaknesses
7


The client system is vulnerable to loss of connectivity,
compromise by downloaded code and infection by
viruses and worms.
The Internet layer is potentially unreliable
The mapping of domain names to IP addresses is very
complex (consequence of cloud need to “steer” traffic)
 Network reliability is much lower than it needs to be
 Much too easy to snoop on traffic or attack connections


The Web Services infrastructure can fail or reconfigure
abruptly, forcing the client to reconnect
Recipe for high assurance
8

Design a system to fail only in “safe ways”
 Nobody
gets hurt, but perhaps the system reports that
it has gone offline


Then do everything practical to enhance reliability,
consistency, security, other needed properties
Today: Focus on the web services running on the
cloud data center
Tradeoffs in “cloud space”
9
Often weak or lacking
Required

The properties we need are in tension!
Snappy response: Every 100ms matters
 Elasticity: Load varies suddenly and dramatically, service
replication levels need to vary accordingly
 Consistency: If distinct service replicas “talk” to multiple
clients about something, they don’t say contradictory things.
 Fault-tolerance: If a replica crashes, the cloud self-heals
 Attack-tolerance: The service is very hard to attack.
 Security: Authenticated clients are limited to performing
authorized actions in accordance with a policy
 Privacy: I can control who uses my data and how

Today’s cloud: As fast as possible
10

In the race to offer the fastest possible services to
the largest possible number of clients today’s cloud
often gives up on other assurance properties
 In
some sense the cloud is insecure and inconsistent by
design!

... but does it have to be that way?
Tomorrow: A high assurance cloud!
11


A single system needs to tell multiple kinds of
assurance stories and not all in the same way
An mHealth application:
 Needs
to reassure the user
that it is trustworthy
 Needs to help the developer
make the right choices
 Must implement complex protocols correctly
 Must be a good citizen on the cloud data center
12
Segment II: Examples
A few slides each on some challenging problems
Each needs the cloud... but each needs some
form of strong assurance guarantee too
Example 1: Power grid
13

Today’s power grid has serious issues
 Wasteful:
As much as 15% of power is lost just moving
it around, and a great deal of “renewable” energy
(solar, wind, tides) is lost because of poor integration
with the standard grid
 Rigid: Ideally, the grid should “adapt” and move
parcels of power much as the Internet moves packets.
 Dumb: even when it is obvious that we could optimize
behavior, the grid uses old, inefficient techniques

Goal: A “smart” power grid!
How a small power grid operates
14

Power flows “like water”
 Path



of least resistance
Governed by Kirchoff’s Law
Power enters at every generator,
exits at every load
Hierarchical structure:
 Primary
“power busses”
 Secondary smaller local feeds
10-Generator, 39-bus
New England System
Technology to enable a smart grid
15




We’ll need to monitor power loads, frequency,
current in real-time, reliably and securely
Use this data to estimate the state of the grid and
to predict its evolution over time
Use those predictions to plan control actions:
increase/decrease generation, borrow “reactive”
power from neighboring regions, adapt pricing, etc
Ultimately the grid will become a new kind of
network. But must also be safe, efficient, and
secure against both mishaps and even attack!
Even mundane problems can hurt
16


California: Repeated episodes of market manipulation
aimed at increasing profits for companies such as Enron
that speculate on pricing
Multi-state and multi-national rolling outages
Causes turmoil for
air traffic, ground traffic,
telephone outages
 Will “smartness” also make
grid more fragile?
 Risk of CyberAttacks?

Control of the smart power grid
17


Suppose that a cloud control system speaks with
“two voices”
In physical infrastructure settings, consequences can
be very costly
“Canadian 50KV bus going offline”
“Switch on the 50KV Canadian bus”
Control of the smart power grid
18


Suppose that a cloud control system speaks with
“two voices”
In physical infrastructure settings, consequences can
be very costly
“Canadian 50KV bus going offline”
Bang!
“Switch on the 50KV Canadian bus”
Power grid summary
19


To make it smart we need to monitor at a massive
scale and use that to initiate control actions
But for this to be safe, we need more that fast
response and elasticity
 We
also need security (so that attackers can’t take the
grid down)
 ... and consistency (as we just saw)
 ... and fault-tolerance (since power systems often
experience failures of various kinds)
Example 2: mHealth
20


A term for everything outside the doctor’s office
(but might be linked to electronic health records)
Goal is to make your life better and healthier
 Encourage
activity
 Discourage poor nutrician choices
 Help patients with chronic conditions manage their
complex medical devices and medications
 Offer caregivers a window into health so that the
patient can maintain independence
What properties are needed in remote
medical care systems?
21
Motion sensor,
fall-detector
Healthcare provider monitors
large numbers of remote
patients
Medication station
tracks, dispenses pills
Integrated glucose monitor and Insulin pump
receives instructions wirelessly
Cloud Infrastructure
Home healthcare application
Durability... scalability... fast response
22
Mrs. Marsh has been dizzy.
Her stomach is upset and she
hasn’t been eating well, yet
her blood sugars are high.
Let’s stop the oral diabetes medication
and increase her insulin, but we’ll need
to monitor closely for a week
Cloud
Infrastructure
Patient Records DB

Need: Strong consistency and durability for data
What do these terms mean?
23

Consistency: Even if accessed by multiple users
concurrently, the data looks like a single database
 This
sounds like it should obviously be true, but when the
data is spread over multiple computers, if they don’t
coordinate their actions, consistency can easily violated
 For example, perhaps machine 1 shows updates machine
2 never saw. Perhaps machine 3 sees all the updates but
has the order confused. Each of these cases can cause
serious inconsistencies.
What do these terms mean?
24

Durability: Even if system components crash and then
recover later, data will not be lost.
 Updates
confuse things: before the update occurs,
clearly it isn’t durable
 After the update is finished, it must have durable effect
 Question to pose: exactly when did it need to be durable?
 Usual
answer: If the effect of an update survives a crash, then
the update itself should also survive the crash
Scalability
25



As we make the system larger, perforance remains
good
It needs to be able to support large numbers of
clients and run on large numbers of cloud computing
systems
Fast response: Queries shouldn’t delay for long.
Updates should have rapid effect on the data.
Guarantees versus “best effort”
26


Today’s cloud systems work well in all of these ways
but without providing strong guarantees except in
certain very specialized cases, like Google’s new
“Spanner” database
Our challenge: can normal people who aren’t in the
Google spanner development team also create
trustworthy cloud computing solutions?
mHealth summary
27

The needs of the system vary depending on what
part of the system we focus on
 In
our example, some aspects need durability in the
sense of a logged database update, while others might
accept durability through in-memory replication
 This illustrates one of many such tradeoffs

If we had more time we could identify a number of
additional issues of this kind
How The Cloud Was Built
28

It is very hard to create software to
run in cloud computing systems
 Everything
must be automated
 You must follow many rules and use many packages
 So open source “tools” have become popular


Examples: Hadoop (a version of MapReduce),
Zookeeper, Graphlab, Pregel, Vowpal Wabbit,
global file systems like GFS, etc.
In this short class we will focus on process group tools
and will use Isis2 as our main example.
An obsession with speed...
29


At very large scale, either a thing is extremely fast, or
unacceptably slow
So everything we do must be shaped by speed!
High assurance is not an option if the solution would be
dramatically slower
 For example, the cloud computing community avoids
databases.



They founded the NoSQL movement (storage, but not as strong as
a SQL database) for this reason.
Similarly we must have speed in mind at all times!
30
Concept: Critical paths
To understand speed, understand the limiting
factors
This forces us to think about critical paths
What limits responsiveness?
31


Top priority: delay until a client receives a reply
Critical path traces actions that contribute to this
delay
Update the monitoring and alarms
criteria for Mrs. Marsh as follows…
Service instance
Response delay seen by
end-user would include
Internet latencies
Service response
delay
Confirmed
Critical path with complex services?
32

When we replicate information but want to be sure
the data won’t be lost, critical path extends into the
replicas
Update the monitoring and alarms
criteria for Mrs. Marsh as follows…
Service instance
Critical path
Response delay seen by
end-user would include
Internet latencies
Service response
delay
Critical path
Confirmed
Critical path
Why do critical paths matter?
33



When we build complex systems it is hard to
imagine how they will behave when we run them
By thinking about the critical performance-limiting
paths, we can focus our attention on specific
elements and not think about the whole system
By avoiding delays on the critical path, we bring
benefits to the whole system!
There are many critical applications
34

Cloud-hosted system to control transportation (think
of Google’s smart cars)
 The
cars have autonomy but they depend on data from
the cloud and would have a much harder challenge if
that data couldn’t be trusted

Banking systems
 Today’s
online banking systems are growing, but as
they happens, more and more security issues arise

Process control
 Chemical
refineries, manufacturing plants, ...
And they come with similar stories
35

In each case we can identify properties that are
 Absolutely
needed for a cloud deployment
 Absolutely needed for safety


And beyond that we might have other assurance
properties that a particular use case doesn’t need
The challenge will be to analyze each application,
and then to translate its needs into cloud solutions
36
Segment III: Consistency
We’ll drill down on the tradeoffs between durability
and consistency
Many cloud systems believe that consistency isn’t
possible: CAP theorem
Yet consistency underlies so many other guarantees
Virtual synchrony model
We’re going to drill down…
37

… on data and service replication

Replication is at the center of cloud computing:
 With
many replicas a service can handle many clients
 And those replicas need as much of the critical data to
be local as possible
 So replication is a key technology. It even underlies
security: we need to replicate the policy database and
certificates that identify principals (clients, servers, etc)
Consistency for replication
38


There are many ways to replicate information
But it becomes tricky if the data or even the service
evolves over time.
 Replication
of changing data can leave a confusing
mess if a request encounters stale versions.
 In some situations these errors can harm the client.
 In others, they could cause security violations.
What do we mean by consistency?
39
A consistent distributed system will often have many
components, but users observe behavior
indistinguishable from that of a single-component
reference system. Our power system example
illustrated a form of inconsistency
“Canadian 50KV bus going offline”
Bang!
“Switch on the 50KV Canadian bus”
Theory of Consistency
40

There are some famous impossibility results
 Fischer,
Lynch and Patterson: FLP theorem proves that
any correct fault-tolerant protocol strong enough to
solve “consensus” (a form of agreement) can also
wedge in the event of certain sequences of failures.
But those sequences turn out to be very rare.
 Brewer’s
CAP theorem posits that you can only have
two from {Consistency, Availability and Partition
Tolerance}. But the proof holds only for a service
running in a WAN, not for one in a single data center.
Relate consistency to speed?
41

How costly is strong consistency?
The cloud computing community debates this topic!
 It is a very contemporary question


We usually pose the question in connection to
replicating data.
Strongly consistent data means “guaranteed to be correct
and current”. Can cloud systems afford strong consistency?
 Weakly consistent data means “best effort but can have
mistakes.” Facebook, eBay, Google all use weak consistency

We will learn more about these topics
42


In today’s lecture we won’t “drill down”
But in lecture 4 we will look more closely at these
theoretical questions
 Mathematics
is a valuable tool for cloud computing
 By making a correspondance of computing ideas to
mathematics we can reason more rigorously
 Yet we will also find that some of the existing theory
has limitations of its own
43
Segment IV:
2
Isis
How does consistency look to the end user?
What is it like to program with a powerful high
assurance library like Isis2?
2
Isis
System
44



A prebuilt technology that automates many of the
hard tasks involved in replicating services and the
data on which they depend
Targets cloud computing settings
Available in open-source from isis2.codeplex.com
 Intended
to be easy to use…
 … but still at an early stage of development
2
Isis
System
45







C# library (but callable from any .NET language)
offering replication techniques for cloud computing
developers
Based on a model that fuses virtual synchrony and
state machine replication models
Research challenges center on creating protocols
that function well despite cloud “events”
Elasticity (sudden scale changes)
Potentially heavily loads
High node failure rates
Concurrent (multithreaded) apps




Long scheduling delays, resource contention
Bursts of message loss
Need for very rapid response times
Community skeptical of “assurance properties”
Isis2 makes developer’s life easier
46
Benefits of Using Formal model



Formal model permits us to
achieve correctness
Isis2 is too complex to use
formal methods as a
development too, but does
facilitate debugging (model
checking)
Think of Isis2 as a collection
of modules, each with
rigorously stated properties
Importance of Sound Engineering




Isis2 implementation needs
to be fast, lean, easy to use
Developer must see it as
easier to use Isis2 than to
build from scratch
Seek great performance
under “cloudy conditions”
Forced to anticipate many
styles of use
2
Isis
makes developer’s life easier
47
Group g = new Group(“myGroup”);
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;
};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};
g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.Join();




First sets up group
Join makes this entity a member.
State transfer isn’t shown
Then can multicast, query.
Runtime callbacks to the
“delegates” as events arrive
Easy to request security
(g.SetSecure), persistence
g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
“Consistency” model dictates the
ordering aseen for event upcalls
and the assumptions user can
make
2
Isis
makes developer’s life easier
48
Group g = new Group(“myGroup”);
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;
};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};
g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.Join();




First sets up group
Join makes this entity a member.
State transfer isn’t shown
Then can multicast, query.
Runtime callbacks to the
“delegates” as events arrive
Easy to request security
(g.SetSecure), persistence
g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
“Consistency” model dictates the
ordering seen for event upcalls
and the assumptions user can
make
2
Isis
makes developer’s life easier
49
Group g = new Group(“myGroup”);
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;
};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};
g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.Join();




g.Send(UPDATE, “Harry”, 20.75);
List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);

First sets up group
Join makes this entity a
member. State transfer isn’t
shown
Then can multicast, query.
Runtime callbacks to the
“delegates” as events arrive
Easy to request security
(g.SetSecure), persistence
“Consistency” model dictates the
ordering seen for event upcalls
and the assumptions user can
make
2
Isis
makes developer’s life easier
50
Group g = new Group(“myGroup”);
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;
};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};
g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.Join();




First sets up group
Join makes this entity a member.
State transfer isn’t shown
Then can multicast, query.
Runtime callbacks to the
“delegates” as events arrive
Easy to request security
(g.SetSecure), persistence
g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
“Consistency” model dictates the
ordering seen for event upcalls
and the assumptions user can make
2
Isis
makes developer’s life easier
51
Group g = new Group(“myGroup”);
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;
};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};
g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.Join();




First sets up group
Join makes this entity a member.
State transfer isn’t shown
Then can multicast, query.
Runtime callbacks to the
“delegates” as events arrive
Easy to request security
(g.SetSecure), persistence
g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
“Consistency” model dictates the
ordering seen for event upcalls
and the assumptions user can make
Concept: A “multi-query”
52

Our lookup is
 Multicast
to the group
 All members respond
Lookup “Harry” in the
Ithaca phone directory
Front end

A chance for parallelism
 Each
can do part of the
job: e.g. search 1/nth of
a database
 Reduces response delays
Names with Harry in
them: ....
With n replicas...
... we get an n times speedup!
Our example was overly simple
53

it didn’t show the “state transfer” code
Corresponds to the “white arrows” in time-line figure
 In Isis2 we have a way
to make checkpoints
 State transfer: Some
active member makes a
checkpoint, and the joiner
loads the state from it.
 The code looks like other operations in our example

p
q
r
s
t
Time:

0
10
20
30
40
50
60
70
Checkpoints can also be used to save group state
during periods when all members are inactive
Adding security: Just one line!
54
Group g = new Group(“myGroup”);

First sets up group
Dictionary<string,double> Values = new Dictionary<string,double>();
g.ViewHandlers += delegate(View v) {
Console.Title = “myGroup members: “+v.members;

};
g.Handlers[UPDATE] += delegate(string s, double v) {
Values[s] = v;
};

g.Handlers[LOOKUP] += delegate(string s) {
g.Reply(Values[s]);
};
g.SetSecure(myKey);

g.Join();
g.Send(UPDATE, “Harry”, 20.75);

List<double> resultlist = new List<double>();
nr = g.Query(ALL, LOOKUP, “Harry”, EOL, resultlist);
Join makes this entity a member.
State transfer isn’t shown
Then can multicast, query. Runtime
callbacks to the “delegates” as
events arrive
Easy to request security,
persistence, tunnelling on TCP...
“Consistency” model dictates the
ordering seen for event upcalls
and the assumptions user can
make
Some uses for process groups
55



To replicate data
maintained by the
members in memory
To replicate actions
taken on an external
service such as a
replicated database
To ensure that all
replicas are configured
the same way



To coordinate the
processing of requests
and load-balance
To offer a way to
parallelize processing
by having each group
member do part of the
work
Fault-tolerance via a
backup scheme
2
Isis
Summary
56



A library that you can invoke from a normal program
written in a normal way
It does the work of creating groups and sending
multicasts and ensuring that the consistency model will
be enforced
The developer just tells it what to do.
She thinks about a parallel distributed application.
 Virtual synchrony eliminates many hard problems

Why not build it yourself from scratch?
SafeSend and Send are two of the protocol components hosted
over what we call the large-scale properties sandbox. The sandbox
addresses issues like flow control, security, etc. All protocols share
and benefit from those properties
57
Isis2 user
object
Isis2 user
object
Isis2 user
object
Other group
members
Membership Oracle
The SandBox itself is mostly composed of “convergent”
protocols that use
probabilistic methods
Isis2 library
Send
CausalSend
OrderedSend
SafeSend
Query....
Flow Control
Group instances and multicast protocols
Group membership
Reliable Sending
Large Group Layer
Fragmentation
Dr. Multicast
Sense Runtime Environment
Message Library
Group Security
Platform Security
Socket Mgt/Send/Rcv
“Wrapped” locks
TCP tunnels (overlay)
Report suspected failures
Views
Oracle Membership
Self-stabilizing
Bootstrap Protocol
Bounded Buffers

These systems are complex, especially if you want to run on platforms like EC2

By using Isis2 you “inherit” 30 years of research on how to make it work
Why focus on
2
Isis ?
58


This is a good question to ask
In fact we could focus on any of a number of other
technologies, including other multicast products
 Such

as Spread, JGroups, C-Ensemble...
But Isis2 is open source and specifically designed for
cloud settings. (Also, Ken built it!)
 So
since our class is short, we will look at Isis2 examples
59
Segment V: Performance
Can Isis2 applications achieve the kinds of
scalable performance and elasticity required in
large cloud deployments?
Revisit our notion of consistency
60


Let’s look again at our mHealth example
We want the best possible performance but we also
want to be sure that the application is “safe” for this
kind of use
 We
need consistency, yet also need snappy response
and elasticity, especially in the monitoring component
 After all, it continuously monitors huge numbers of
patients.
 What limits scalability?
Speed of updates
61

Isis2 offers many ways to do updates
 RawSend,
Send, CausalSend, OrderedSend, SafeSend
 Each has different consistency / durability guarantees

As a developer, you’ll want to use the fastest option
that is still safe in your setting
 ...
Hence will need to understand how each works
 ... and how fast each solution will be

Today we’ll just look at this superficially
Example: Speed of updates
62

Isis2 offers several ways to do updates (we will visit
them more carefully later)

They have big performance implications

But speed can have more than one definition!
2
Isis :
Send v.s. in-memory SafeSend
63
Send scales best, but SafeSend with
in-memory (rather than disk) logging and small
numbers of acceptors isn’t terrible.
Latency  ops/second
64

Latency: Delay before external user sees action

Ops/second: total throughput
most purposes systems “like” Isis2 offer basic
performance of about 1000 ops/second
 But by grouping requests into batches of ~50/request,
services that can support ~50,000 ops/second are
feasible
 Building them is challenging, but we won’t focus on that
engineering topic in these lectures
 For
Jitter: how “steady” are latencies?
65
The “spread” of latencies is much
better (tighter) with Send: the 2-phase
SafeSend protocol is sensitive to
scheduling delays
Cornell (Birman): No distribution restrictions.
Flush delay as function of shard size
66
Flush is fairly fast if we only wait for
acks from 3-5 members, but slow
if we wait for all members.
Isis2 lets developer set the threshold.
Cornell (Birman): No distribution restrictions.
So I want Send+Flush, right?
67

The problem is that the different solutions
offer different guarantees
 The
fastest solutions have weaker guarantees
 Using them safely involves understanding these
properties in order to decide whether they are good
enough for the desired purpose

But there are subtle issues we don’t have time to
discuss in today’s lecture. We will revisit tomorrow.
Raw speed isn’t the whole story!
68



When building a system such as this we need to
look at performance but also at steady behavior
Here’s an example of a problem we ran into when
doing the experiments I just showed you
As we’ll see, Isis2 had an instability. We think we’ve
fixed it but it illustrates an important point
The experiment we did
69





We made a timeline picture from left to right
One node (the bottom one) sends multicasts
The others log the time of receipt
We graphed the delay, sorted from slowest (top) to
fastest (bottom) delays
Here’s what we saw
Debugging: Stabilization bug
70
Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011
As the application ran, it slowed down!
71

At first the system was fast: even the slowest nodes
at the top had short delays

But within a few multicasts they slowed down

Then something “resets” them and they speed up
 We
tracked it down to a problem with garbage
collection in our system
 Modifying that protocol helped smooth things out
Debugging : Stabilization bug fixed
72
Birman: DARPA MRC Kickoff, Washington, Nov 3-4 2011
Debugging : 358-node run slowdown
358-node run slowdown: Zoom in
358-node run slowdown: Filter
Summary of insights from example?
76



Tools like Isis2 enable us to build cloud-scale
replication based services with strong guarantees
But today, at least, they demand a lot from the
developer, who needs to really understand the
choices and their implications
As Isis2 evolves, this problem will be reduced: the
system will eventually automate many decisions,
including picking the right update primitives for you
77
Segment V: Conclusions
We’ve scratched the surface but there is much
more to be explored
Cornell’s high assurance researchers are creating
solutions for tomorrow’s demanding applications
Key take-away points
78



Cloud computing, today, isn’t very friendly to high
assurance applications
This is a problem because those applications are
increasingly forced to migrate to the cloud for reasons
of cost, scalability or just because the cloud is the
dominant paradigm today
But we can already use tools like Isis2 to solve these
problems and as they become easier to work with, the
community able to build these solutions will grow
Key take-away points
79

With Isis2 we can easily create programs that run on
cloud platforms like EC2 or even Android mobile
 They
form into groups and coordinate or replicate data
or actions via group primitives
 The concept is powerful and easily visualized


But tuning and doing sophisticated fault-tolerance
remains challenging.
In the remaining lectures we will explore these issues
The last word...
80


The word on the street is that cloud
computing will rule but that the cloud
can’t do high assurance
But the word in the hallways at Cornell differs!
see Isis2 as our proof-by-demonstration that it can
be done
 Even so, the engineering challenge remains enormous
 We
Learning more
81

Stay in the class. We’ll show you how!

Download the Isis2 system from isis2.codeplex.com
 You
can access the user’s manual
 The code itself (currently v2.xxx, a very stable release)
 And we maintain a discussion and issues board there
Learning more
82

My textbook covers this topic in depth
“Guide to Reliable Distributed Systems: Building HighAssurance Applications and Cloud-Hosted Services”
Ken Birman. Springer Verlag, February 2012

A paper focused entirely on today’s topic is:
Overcoming CAP with Consistent Soft-State Replication. Kenneth P. Birman, D. Freedman,
Q. Huang and Patrick Dowell. IEEE Computer Magazine (special issue on “The Growing
Impact of the CAP Theorem”). Volume 12. pp. 50-58. February 2012.
You can download a copy from:
http://www.cs.cornell.edu/projects/quicksilver/pubs.html
Download