CS514: Intermediate Course in Operating Systems Lecture 21 Nov. 7

advertisement
CS514: Intermediate
Course in Operating
Systems
Professor Ken Birman
Ben Atkin: TA
Lecture 21 Nov. 7
Bimodal Multicast
• Work is joint:
– Ken Birman, Mark Hayden, Oznur
Ozkasap, Zhen Xiao, Mihai Budiu,
Yaron Minsky
• Paper appeared in ACM
Transactions on Computer
Systems
– May, 1999
Reliable Multicast
• Family of protocols for 1-n
communication
• Increasingly important in modern
networks
• “Reliability” tends to be at one of
two extremes on the spectrum:
– Very strong guarantees, for example to
replicate a controller position in an ATC
system
– Best effort guarantees, for applications
like Internet conferencing or radio
Reliable Multicast
• Common to try and reduce problems
such as this to theory
• Issue is to specify the goal of the
multicast protocol and then
demonstrate that a particular
implementation satisfies its spec.
• Each form of reliability has its down
specification and its own pros and
cons
Classical “multicast”
problems
• Related to “coordinated attack,”
“consensus”
• The two “schools” of reliable multicast
– Virtual synchrony model (Isis, Horus,
Ensemble)
• All or nothing message delivery with ordering
• Membership managed on behalf of group
• State transfer to joining member
– Scalable reliable multicast
• Local repair of problems but no end-to-end
guarantees
Virtual Synchrony
Model
G0={p,q}
G1={p,q,r,s}
p
G2={q,r,s}
G3={q,r,s,t}
crash
q
r
s
t
r, s request to join
r,s added; state xfer
p fails
t requests to join
t added, state xfer
... to date, the only widely adopted model for consistency and
fault-tolerance in highly available networked applications
Virtual Synchrony Tools
• Various forms of replication:
– Replicated data, replicate an object,
state transfer for starting new replicas...
– 1-many event streams (“network news”)
• Load-balanced and fault-tolerant
request execution
• Management of groups of nodes or
machines in a network setting
ATC system (simplified)
Onboard
Radar
X.500
Directory
Controllers
Air Traffic Database
(flight plans, etc)
Roles for reliable
multicast?
• Replicate the state of controller
consoles so that a group of
consoles can support a team of
controllers fault-tolerantly
Roles for reliable
multicast?
• Distribute routine updates from the
radar system, every 3-5 seconds
(two kinds: images and digital
tracking info)
Roles for reliable
multicast?
• Distribute flight plan updates
and decisions rarely (very
critical information)
Other settings that use
reliable multicast
• Basis of several stock
exchanges
• Could be used to implement
enterprise database systems
and wide-area network services
• Could be the basis of web
caching tools
But vsync protocols
don’t scale well
• For small systems, performance is
great.
• For large systems can demonstrate
very good performance… but only if
applications “cooperate”
• In large systems, under heavy load,
with applications that may act a
little flakey, performance is very
hard to maintain
Stock Exchange Problem:
Vsync. multicast is too “fragile”
Most members are
healthy….
… but one is slow
Virtually synchronous Ensemble multicast protocols
average throughput on nonperturbed members
250
group size: 32
group size: 64
group size: 96
200
150
100
50
0
0
0.1
0.2
0.3
0.4
0.5
perturb rate
0.6
0.7
0.8
0.9
Throughput as one member of a multicast group is "perturbed" by forcing it to
sleep for varying amounts of time.
Why does this happen?
• Superficially, because data for the
slow process piles up in the sender’s
buffer, causing flow control to kick in
(prematurely)
• But why does the problem grow worse
as a function of group size, with just
one “red” process?
• Our theory is that small perturbations
happen all the time (but more study is
needed)
Dilemma confronting
developers
• Application is extremely critical: stock
market, air traffic control, medical system
• Hence need a strong model, guarantees
– SRM, RMTP, etc: model is too weak
– Anyhow, we’ve seen that they don’t scale either!
• But these applications often have a softrealtime subsystem
– Steady data generation
– May need to deliver over a large scale
Today introduce a new
design pt.
• Bimodal multicast is reliable in a
sense that can be formalized, at least
for some networks
– Generalization for larger class of networks
should be possible but maybe not easy
• Protocol is also very stable under
steady load even if 25% of processes
are perturbed
• Scalable in much the same way as
SRM
Bimodal Attack
Consensus is
impossible!
Attack!
• General commands
soldiers
• If almost all attack
victory is certain
• If very few attack,
they will perish but
Empire survives
• If many attack,
Empire is lost
Worse and worse...
• General sends orders via couriers,
who might perish or get lost in the
forest (no spies or traitors)
• Some of the armies are sick with flu.
They won’t fight and may not even
have healthy couriers to participate
in protocol
(Message omission, crash and timing
faults.)
Properties Desired
• Bimodal distribution of multicast
delivery
• Ordering: FIFO on sender basis
• Stability: lets us garbage collect
without violating bimodal
distribution
• Detection of lost messages
More desired properties
• Formal model allows us to
reason about how the primitive
will behave in real systems
• Scalable protocol: costs
(background messages,
throughput) do not degrade as a
function of system size
• Extremely stable throughput
Naming our new
protocol
• Bimodal doesn’t lend itself to
traditional contractions (bcast is
taken, bicast and bimcast sound
pretty awful)
• Focus on probabilistic nature of
guarantees
• Call it pbcast for sure
• Historical footnote: these are
multicast protocols but everyone uses
“bcast” in names
Environment
• Will assume that most links have
known throughput and loss
properties
• Also assume that most processes
are responsive to messages in
bounded time
• But can tolerate some flakey links
and some crashed or slow
processes.
Start by using unreliable multicast to rapidly
distribute the message. But some messages
may not get through, and some processes may
be faulty. So initial state involves partial
distribution of multicast(s)
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
Delivery? Garbage
Collection?
• Deliver a message when it is in FIFO
order
• Garbage collect a message when
you believe that no “healthy”
process could still need a copy (we
used to wait 10 rounds, but now are
using gossip to detect this condition)
• Match parameters to intended
environment
Need to bound costs
• Worries:
– Someone could fall behind and never
catch up, endlessly loading everyone
else
– What if some process has lots of stuff
others want and they bombard him with
requests?
– What about scalability in buffering and
in list of members of the system, or
costs of updating that list?
Optimizations
• Request retransmissions most
recent multicast first
• Idea is to “catch up quickly”
leaving at most one gap in the
retrieved sequence
Optimizations
• Participants bound the amount
of data they will retransmit
during any given round of
gossip. If too much is solicited
they ignore the excess requests
Optimizations
• Label each gossip message
with senders gossip round
number
• Ignore solicitations that have
expired round number,
reasoning that they arrived very
late hence are probably no
longer correct
Optimizations
• Don’t retransmit same message
twice in a row to any given
destination (the copy may still
be in transit hence request may
be redundant)
Optimizations
• Use IP multicast when
retransmitting a message if several
processes lack a copy
– For example, if solicited twice
– Also, if a retransmission is received
from “far away”
– Tradeoff: excess messages versus low
latency
• Use regional TTL to restrict
multicast scope
Scalability
• Protocol is scalable except for
its use of the membership of
the full process group
• Updates could be costly
• Size of list could be costly
• In large groups, would also
prefer not to gossip over long
high-latency links
Can extend pbcast to
solve both
• Could use IP multicast to send initial
message. (Right now, we have a
tree-structured alternative, but to
use it, need to know the
membership)
• Tell each process only about some
subset k of the processes, k << N
• Keeps costs constant.
Router overload problem
• Random gossip can overload a
central router
• Yet information flowing through
this router is of diminishing
quality as rate of gossip rises
• Insight: constant rate of gossip
is achievable and adequate
Limited
bandwidth
(1500Kbits)
Receiver
Bandwidth of
other
links:10Mbits
Sender
Receiver
High noise rate
Sender
150
200
Uniform
Hierarchical
Uniform
Hierarchical
180
160
100
120
nmsgs/sec
nmsgs/sec
140
100
80
50
60
40
20
0
0
0
10
20
30
40
50
sec
60
70
80
90
100
0
5
10
15
group size
20
25
30
Hierarchical Gossip
• Weight gossip so that
probability of gossip to a
remote cluster is smaller
• Can adjust weight to have
constant load on router
• Now propagation delays rise…
but just increase rate of gossip
to compensate
Hierarchical Gossip
1500
150
1000
100
nmsgs/sec
propagation time(ms)
Uniform
Hierarchical
Fast hierarchical
500
50
Uniform
Hierarchical
Fast hierarchical
0
0
5
10
15
group size
20
25
30
0
0
5
10
15
group size
20
25
30
Remainder of talk
• Show results of formal analysis
– We developed a model (won’t do the
math here -- nothing very fancy)
– Used model to solve for expected
reliability
• Then show more experimental data
• Real question: what would pbcast
“do” in the Internet? Our
experience: it works!
Idea behind analysis
• Can use the mathematics of
epidemic theory to predict
reliability of the protocol
• Assume an initial state
• Now look at result of running B
rounds of gossip: converges
exponentially quickly towards
atomic delivery
p{#processes=k}
Pbcast bimodal delivery distribution
1.E+00
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
0
5
10
15
20
25
30
35
40
45
50
number of processes to deliver pbcast
Failure analysis
• Suppose someone tells me what
they hope to “avoid”
• Model as a predicate on final
system state
• Can compute the probability
that pbcast would terminate in
that state, again from the model
Two predicates
• Predicate I: A faulty outcome is
one where more than 10% but
less than 90% of the processes
get the multicast
…call this the “General’s
predicate” because he views it
as a disaster if many but not
most of his troops attack
Two predicates
• Predicate II: A faulty outcome is one
where roughly half get the multicast
and failures might “conceal” true
outcome
… this would make sense if using
pbcast to distribute quorum-style
updates to replicated data. The
costly hence undesired outcome is
the one where we need to rollback
because outcome is “uncertain”
Two predicates
• Predicate I: More than 10% but less
than 90% of the processes get the
multicast
• Predicate II: Roughly half get the
multicast but crash failures might
“conceal” outcome
• Easy to add your own predicate. Our
methodology supports any predicate
over final system state
Scalability of Pbcast reliability
P{failure}
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
10
15
20
25
30
35
40
45
50
55
60
#processes in system
Predicate I
Predicate II
P{failure}
Effects of fanout on reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
1
2
3
4
5
6
7
8
9
10
fanout
Predicate I
Predicate II
fanout
Fanout required for a specified reliability
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
25
30
35
40
45
50
#processes in system
Predicate I for 1E-8 reliability
Predicate II for 1E-12 reliability
Scalability of Pbcast reliability
1.E+00
1.E-05
1.E-05
1.E-10
P{failure}
p{#processes=k}
Pbcast bimodal delivery distribution
1.E-10
1.E-15
1.E-20
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
1.E-25
10
15
1.E-30
0
5
10
15
20
25
30
35
40
45
P{failure}
fanout
5
6
7
40
45
50
55
60
Predicate II
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
4
35
Fanout required for a specified reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
3
30
Predicate I
Effects of fanout on reliability
2
25
#processes in system
50
number of processes to deliver pbcast
1
20
8
9
10
25
30
35
40
45
50
#processes in system
fanout
Predicate I for 1E-8 reliability
Predicate I
Predicate II
Figure 5: Graphs of analytical results
Predicate II for 1E-12 reliability
Discussion
• We see that pbcast is indeed
bimodal even in worst case, when
initial multicast fails
• Can easily tune parameters to obtain
desired guarantees of reliability
• Protocol is suitable for use in
applications where bounded risk of
undesired outcome is sufficient
Model makes
assumptions...
• These are rather simplistic
• Yet the model seems to predict
behavior in real networks, anyhow
• In effect, the protocol is not merely
robust to process perturbation and
message loss, but also to
perturbation of the model itself
• Speculate that this is due to the
incredible power of exponential
convergence...
Experimental work
• SP2 is a large network
– Nodes are basically UNIX workstations
– Interconnect is basically an ATM
network
– Software is standard Internet stack
(TCP, UDP)
• We obtained access to as many as
128 nodes on Cornell SP2 in Theory
Center
• Ran pbcast on this, and also ran a
second implementation on a real
network
Example of a question
• Create a group of 8 members
• Perturb one member in style of
Figure 1
• Now look at “stability” of throughput
– Measure rate of received messages
during periods of 100ms each
– Plot histogram over life of experiment
0.6
0.4
Traditional Protocol
w ith .45 sleep
probability
0.2
0
Inter-arrival spacing (m s)
1
0.8
Pbcast w ith .05
sleep probability
0.6
Pbcast w ith .45
sleep probability
0.4
0.2
0
0.
00
5
0.
01
5
0.
02
5
0.
03
5
0.
04
5
0.
05
5
0.
06
5
Traditional Protocol
w ith .05 sleep
probability
Histogram of throughput for pbcast
Probability of occurence
0.8
0.
00
5
0.
01
5
0.
02
5
0.
03
5
0.
04
5
0.
05
5
0.
06
5
Probability of occurence
Histogram of throughput for Ensemble's FIFO
Virtual Synchrony Protocol
Inter-arrival spacing (m s)
Now revisit Figure 1 in
detail
• Take 8 machines
• Perturb 1
• Pump data in at varying rates,
look at rate of received
messages
Throughput with
perturbed processes
High bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional: at unperturbed host
pbcast: at unperturbed host
180
traditional: at perturbed host
pbcast: at perturbed host
160
Low bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional w/1 perturbed
pbcast w/1 perturbed
throughput for traditional, measured at perturbed host
throughput for pbcast measured at perturbed host
180
160
140
average throughput
average throughput
140
120
100
80
120
100
80
60
60
40
40
20
20
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
Throughput variation as
a function of scale
mean and standard deviation of pbcast throughput: 96-member group
220
215
215
210
210
throughput (msgs/sec)
throughput (msgs/sec)
mean and standard deviation of pbcast throughput: 16-member group
220
205
200
195
205
200
195
190
190
185
185
180
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
180
0.5
mean and standard deviation of pbcast throughput: 128-member group
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
standard deviation of pbcast throughput
220
150
215
standard deviation
throughput (msgs/sec)
210
205
200
195
100
50
190
185
180
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
0
0
50
100
process group size
150
Impact of packet loss on
reliability and
retransmission rate
Pbcast background overhead: perturbed process percentage (25%)
Pbcast with system-wide message loss: high and low bandwidth
100
200
8 nodes
16 nodes
64 nodes
128 nodes
90
180
140
120
100
hbw:8
hbw:32
hbw:64
hbw:96
lbw:8
lbw:32
lbw:64
lbw:96
80
60
40
20
0
retransmitted messages (%)
average throughput of receivers
80
160
70
60
50
40
30
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12 0.14
system-wide drop rate
0.16
0.18
0.2
0
0.05
Notice that when network becomes
overloaded, healthy processes
experience packet loss!
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
What about growth of
overhead?
• Look at messages other than original
data distribution multicast
• Measure worst case scenario: costs
at main generator of multicasts
• Side remark: all of these graphs look
identical with multiple senders or if
overhead is measured elsewhere….
16 nodes - 4 perturbed processes
100
80
60
40
20
0
average
overhead
average
overhead
(msgs/sec)
8 nodes - 2 perturbed processes
0.1
0.2
0.3
0.4
100
80
60
40
20
0
0.1
0.5
0.2
100
80
60
40
20
0
0.3
0.4
perturb rate
0.5
128 nodes - 32 perturbed processes
average
overhead
average
overhead
64 nodes - 16 perturbed processes
0.2
0.4
perturb rate
perturb rate
0.1
0.3
0.5
100
80
60
40
20
0
0.1
0.2
0.3
0.4
perturb rate
0.5
Growth of Overhead?
• Clearly, overhead does grow
• We know it will be bounded
except for probabilistic
phenomena
• At peak, load is still fairly low
Pbcast versus SRM, 0.1%
packet loss rate on all links
PBCAST and SRM with system wide constant noise, tree topology
PBCAST and SRM with system wide constant noise, tree topology
15
15
adaptive SRM
repairs/sec received
Tree
networks
requests/sec received
adaptive SRM
10
SRM
5
10
SRM
5
Pbcast-IPMC
0
Pbcast
Pbcast-IPMC
0
10
20
30
40
50
60
group size
70
80
90
100
Pbcast
0
0
10
PBCAST and SRM with system wide constant noise, star topology
20
30
40
50
60
group size
70
80
90
100
PBCAST and SRM with system wide constant noise, star topology
15
60
50
SRM
10
SRM
repairs/sec received
Star
networks
requests/sec received
adaptive SRM
5
40
30
20
adaptive SRM
10
0
Pbcast
Pbcast-IPMC
0
10
20
30
40
50
60
group size
70
80
90
100
0
0
10
20
30
40
50
60
group size
70
80
Pbcast
Pbcast-IPMC
90
100
Pbcast versus SRM: link
utilization
PBCAST and SRM with system wide constant noise, tree topology
PBCAST and SRM with system wide constant noise, tree topology
20
Pbcast
Pbcast-IPMC
SRM
Adaptive SRM
45
40
link utilization on an incoming link to sender
link utilization on an outgoing link from sender
50
35
30
25
20
15
10
5
0
Pbcast
Pbcast-IPMC
SRM
Adaptive SRM
18
16
14
12
10
8
6
4
2
0
10
20
30
40
50
60
group size
70
80
90
100
0
0
10
20
30
40
50
60
group size
70
80
90
100
Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology
30
Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology
30
25
25
adaptive SRM
20
repairs/sec received
requests/sec received
Pbcast versus SRM: 300 members on a
1000-node tree, 0.1% packet loss rate
SRM
15
10
5
0
adaptive SRM
20
SRM
15
10
5
Pbcast
Pbcast-IPMC
0
20
40
60
80
group size
100
120
0
Pbcast-IPMC
Pbcast
0
20
40
60
80
group size
100
120
Pbcast Versus SRM:
Interarrival Spacing
Pbcast versus SRM: Interarrival
spacing (500 nodes, 300
members, 1.0% packet loss)
1% system wide noise, 500 nodes, 300 members, pbcast-grb
1% system wide noise, 500 nodes, 300 members, srm
100
90
90
90
80
80
80
70
60
50
40
30
percentage of occurrences
100
percentage of occurrences
percentage of occurrences
1% system wide noise, 500 nodes, 300 members, pbcast-grb
100
70
60
50
40
30
70
60
50
40
30
20
20
20
10
10
10
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35
latency at node level (second)
0.4
0.45
0.5
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35 0.4
latency after FIFO ordering (second)
0.45
0.5
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35
latency at node level (second)
0.4
0.45
0.5
Real Data: Spinglass on a 10Mbit
ethernet (35 Ultrasparc’s)
Injected noise, retransmission
limit disabled
Injected noise, retransmission
limit re-enabled
150
180
140
160
130
140
120
120
110
#msgs
200
100
100
80
90
60
80
40
70
20
60
0
0
10
20
30
40
50
sec
60
70
80
90
100
50
0
10
20
30
40
50
sec
60
70
80
90
100
Networks structured as
clusters
Receiver
Sender
High noise rate
Limited
bandwidth
(1500Kbits)
Bandwidth of
other
links:10Mbits
Receiver
Sender
Delivery latency in a 2-cluster LAN, 50%
noise between clusters, 1% elsewhere
partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb
45
40
40
35
35
percentage of occurrences
percentage of occurrences
partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb
45
30
25
20
15
30
25
20
15
10
10
5
5
0
0
0
1
2
3
4
latency at node level (second)
5
6
0
1
45
40
40
35
35
30
25
20
15
15
5
2
3
4
latency at node level (second)
6
20
5
1
5
25
10
0
6
30
10
0
5
srm adaptive
45
percentage of occurrences
percentage of occurrences
srm
2
3
4
latency after FIFO ordering (second)
5
6
0
0
1
2
3
4
latency at node level (second)
Requests/repairs and latencies
with bounded router bandwidth
limited bw on router, 1% system wide constant noise
limited bw on router, 1% system wide constant noise
200
120
pbcast-grb
srm
pbcast-grb
srm
180
100
140
repairs/sec received
requests/sec received
160
120
100
80
60
40
80
60
40
20
20
0
0
20
40
60
group size
80
100
120
limited bw on router, noise=1%, n=100, pbcast-grb
20
40
60
group size
5
2
4
6
8
10
12
14
latency at node level (second)
16
18
20
10
5
0
100
120
15
percentage of occurrences
10
0
80
limited bw on router, noise=1%, n=100, srm
15
percentage of occurrences
percentage of occurrences
0
limited bw on router, noise=1%, n=100, pbcast-grb
15
0
0
0
2
4
6
8
10
12
14
latency after FIFO ordering (second)
16
18
20
10
5
0
0
2
4
6
8
10
12
14
latency at node level (second)
16
18
20
Discussion
• Saw that stability of protocol is
exceptional even under heavy
perturbation
• Overhead is low and stays low with
system size, bounded even for heavy
perturbation
• Throughput is extremely steady
• In contrast, virtual synchrony and
SRM both are fragile under this sort
of attack
Programming with
pbcast?
• Most often would want to split
application into multiple subsystems
– Use pbcast for subsystems that
generate regular flow of data and can
tolerate infrequent loss if risk is
bounded
– Use stronger properties for subsystems
with less load and that need high
availability and consistency at all times
Programming with
pbcast?
• In stock exchange, use pbcast for
pricing but abcast for “control”
operations
• In hospital use pbcast for telemetry
data but use abcast when changing
medication
• In air traffic system use pbcast for
routine radar track updates but
abcast when pilot registers a flight
plan change
Our vision: One protocol
side-by-side with the other
• Use virtual synchrony for replicated
data and control actions, where
strong guarantees are needed for
safety
• Use pbcast for high data rates,
steady flows of information, where
longer term properties are critical
but individual multicast is of less
critical importance
Next steps
• Currently exploring memory use
patterns and scheduling issues for
the protocol
• Will study:
– How much memory is really needed?
Can we configure some processes with
small buffers?
– When to use IP multicast and what TTL
to use
– Exploiting network topology information
– Application feedback options
Summary
• New data point in spectrum
– Virtual synchrony protocols (atomic
multicast as normally implemented)
– Bimodal probabilistic multicast
– Scalable reliable multicast
• Demonstrated that pbcast is suitable
for analytic work
• Saw that it has exceptional stability
Epilogue
• Baron: “Although successful, your
attack was reckless! It is
impossible to achieve consensus
on attack under such conditions!”
• General: “I merely took a
calculated risk….”
Download