presentation source

advertisement
Ensemble and Beyond
Presentation to David Tennenhouse, DARPA ITO
Ken Birman
Dept. of Computer Science
Cornell University
Quick Timeline
• Cornell has developed 3 generations of
reliable group communication technology
– Isis Toolkit: 1987-1990
– Horus System: 1990-1994
– Ensemble System: 1994-1999
• Today starting a major shift in emphasis
– Spinglass Project: 1999-
Questions to consider
• Have these projects been successful?
• What is the future of Ensemble if we move
to a new and different focus?
• Nature of the new opportunity we now
perceive
Timeline
Isis
Horus
•
•
•
•
Ensemble
Introduced reliability into group computing
Virtual synchrony execution model
Fairly elaborate, monolithic, but adequate speed
Many transition successes
•New York, Swiss Stock Exchanges
•French Air Traffic Control console system
•Southwestern Bell Telephone network mgt.
•Hiper-D (next generation AEGIS)
Virtual Synchrony Model
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
Why a “model”?
• Models can be reduced to theory – we can
prove the properties of the model, and can
decide if a protocol achieves it
• Enables rigorous application-level reasoning
• Otherwise, the application must guess at
possible misbehaviors and somehow
overcome them
French ATC system (simplified)
Onboard
Radar
X.500
Directory
Controllers
Air Traffic Database
(flight plans, etc)
A center contains...
• Perhaps 50 “teams” of 3-5 controllers each
• Each team supported by workstation cluster
• Cluster-style database server has flight plan
information
• Radar server distributes real-time updates
• Connections to other control centers (40 or
so in all of Europe, for example)
Process groups arise here:
• Cluster of servers running critical database
server programs
• Cluster of controller workstations support
ATC by teams of controllers
• Radar must send updates to the relevant
group of control consoles
• Flight plan updates must be distributed to
the “downstream” control centers
Use of our model?
• French government knows requirements for
safety in ATC application
• With our model, we can reduce their need to
a formal set of statements
• This lets us establish that our solution will
really be safe in their setting
• Contrast with usual ad-hoc methodologies...
Timeline
Isis
Horus
Ensemble
• Simpler, faster group communication system
• Uses a modular layered architecture. Layers are
“compiled,” headers compressed for speed
• Supports dynamic adaptation and real-time apps
• Partitionable version of virtual synchrony
• Transitioned primarily through Stratus Computer
•Phoenix system
•Basis of Stratus f.tol. Proposal to OMG
Layered Microprotocols in Horus
Interface to Horus is extremely flexible
Horus manages group abstraction
group semantics (membership, actions,
events) defined by stack of modules
Ensemble stacks
plug-and-play
modules to give
design flexibility
to developer
ftol
vsync
filter
encrypt
sign
Processes Communicate Through
Identical Multicast Protocol Stacks
ftol
ftol
ftol
vsync
vsync
vsync
encrypt
encrypt
encrypt
Superimposed Groups in Application
With Multiple Subsystems
Magenta group for video communication
Orange for
control and
coordination
ftol ftol
vsync
vsync
encrypt
encrypt
ftol ftol
ftol ftol
vsync
vsync
encrypt
encrypt
vsync
vsync
encrypt
encrypt
Timeline
Isis
Horus
Ensemble
• Horus-like stacking architecture, equally fast
• Includes an innovative group-key mechanism
for secure group multicast and key management
• Uses high level language and can be formally
proved correct, an unexpected and major success
• Many early transition successes
•SC-21, Quorum via collaboration with BBN
•Nortel, STC: potential commercial users
•Discussions with MS (COM+), Sun
(RMI.next): could be basis of standards.
Proving Ensemble Correct
• Unlike Isis and Horus, Ensemble is coded in
a language with strong semantics (ML)
• So we took a spec. of virtual synchrony
from MIT’s IOA group (Nancy Lynch)
• And are actually able to prove that our code
implements the spec. and that the spec
captures the virtual synchrony property!
What Next?
• Continue some work with Ensemble
– Keep it alive, support and extend it
– Play an active role in transition
– Assist standards efforts
• But shift in focus to a completely new effort
– Emphasize adaptive behavior, extreme scalability,
robustness against local disruption
– Fits “Intrinisically Survivable Systems” initiative
Throughput Stability: Achilles
Heel of Group Multicast
• When scaled to even modest environments,
overheads of virtual synchrony become a
problem
– One serious challenge involves management of
group membership information
– But multicast throughput also becomes unstable
with high data rates, large system size, too.
• Stability of protocols like SRM unknown
Stock Exchange Problem:
Vsync. multicast is too “fragile”
Most members are
healthy….
… but one is slow
throughput
(msgs/sec)
Effect of Perturbation
250
200
150
100
50
0
Virtual Synchrony Protocol
Ideal
Pbcast Protocol
Actual
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
amount perturbed
Figure 1: Multicast throughput in an 8-member group perturbed by transient failures
Bimodal Multicast in Spinglass
• A new family of protocols with stable
throughput, extremely scalable, fixed and
low overhead per process and per message
• Gives tunable probabilistic guarantees
• Includes a membership protocol and a
multicast protocol
• Requires some very weak QoS assumptions
Start by using unreliable multicast to rapidly
distribute the message. But some messages
may not get through, and some processes may
be faulty. So initial state involves partial
distribution of multicast(s)
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
Scalability of Pbcast reliability
1.E+00
1.E-05
1.E-05
1.E-10
P{failure}
p{#processes=k}
Pbcast bimodal delivery distribution
1.E-10
1.E-15
1.E-20
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
1.E-25
10
15
1.E-30
0
5
10
15
20
25
30
35
40
45
P{failure}
fanout
5
6
7
40
45
50
55
60
Predicate II
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
4
35
Fanout required for a specified reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
3
30
Predicate I
Effects of fanout on reliability
2
25
#processes in system
50
number of processes to deliver pbcast
1
20
8
9
10
25
30
35
40
45
50
#processes in system
fanout
Predicate I for 1E-8 reliability
Predicate I
Predicate II
Figure 5: Graphs of analytical results
Predicate II for 1E-12 reliability
Throughput
measured at
unperturbed
process
High Bandwidth measurements with varying numbers of sleepers
200
150
100
50
0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Probability of sleep event
Traditional w/1 sleeper
Pbcast w/1 sleeper
Traditional w/3 sleepers
Pbcast w 3/sleepers
Traditional w/5 sleepers
Pbcast w/5 sleepers
Spinglass:
Summary of objectives
• Radically different approach yields stable,
scalable protocols with steady throughput
• Small footprint, tunable to match conditions
• Completely asynchronous, hence demands
new style of application development
• But opens the door to a new lightweight
reliability technology supporting large
autonomous environments that adapt
Download