CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA

advertisement
CS514: Intermediate Course in
Operating Systems
Professor Ken Birman
Vivek Vishnumurthy: TA
Autonomic computing


Hot new area intended as a response to
some of our concerns
Basic idea is to emulate behaviors
common in biological systems


For example: if you rush to class, your
heart beats faster and you might sweat a
little… but this isn’t something you “intend”
The response is an “autonomic” one
Goals for autonomic systems

The so-called “self-*” properties






Self-installing
Self-configuring
Self-monitoring
Self-diagnosing and self-repairing
Adaptive when loads or resources change
Can we create autonomic systems?
What makes it hard?

From the inside of a system, options are
often very limited


Last time we saw that even detecting a
failure can be very hard… and that if we
can’t be sure, making “decisions” can be
even harder!
Also, most systems are designed for
simplicity and speed. Self-* mechanisms
add complexity and overhead
Modern approach


Perhaps better to think of external
platform tools that help a system out
The platform can automate tasks that a
user might find hard to do on their own


Such as restart after a failure
You tell the platform “please keep 5 copies
of my service running.” If a copy crashes,
it finds an unloaded node and restarts it
Slight shift in perspective


Instead of each application needing to
address these hard problems… we can
shift the role to standardized software
It may have ways to solve hard
problems that end-users can’t access


Like ways to ask hardware for “help”
Or lots of ways to sense status
Werner Vogels (Amazon CTO)

Discussed “world wide failure detectors”



Issue: How to sense failure
We saw that this is hard to get right
A neighbor’s mailbox is overflowing…
should you call 911?


Leaving the mail out isn’t “proof of death”
Many other ways to sense “health”
How can a platform check health?





Application exits but O/S is still running
O/S reboots itself
NIC card loses carrier, then regains it
after communication links have broken
O/S may have multiple communication
paths…. even if application gets
“locked” onto just one path
… the list goes on and on
Vogels proposal?

He urges that we build large-scale
“system membership services”



The service would track membership in the
data center or network
Anyone talking to a component would also
register themselves, and their “interests”
with the service
It reports problems, consistently and
quickly
Such a service can overcome
Jack and Jill’s problem!


They couldn’t agree on status
But a service can make a rule


Even if an application is running, if it loses
connection to a majority of the servers
running the “health service”, we consider it
to have crashed.
With this rule, the health abstraction
can be implemented by the platform!
Jack and Jill with a Failure Detector



Jack and Jill agree to check their mail at
least once every ten minutes
The failure detector, running as a
system service, monitors their actions
A failure to check mail triggers a
system-wide notification

Terrible news. Sad tiddings. Jack is dead!

If it makes a mistake… tough luck!
How to make TCP use this

Take TCP



Disable the “SO_KEEPALIVE” feature
Now TCP won’t sense timeouts and hence
will never break a connection
Now write a wrapper


User makes a TCP connection… wrapper
registers with the health service
Health problem? Break the connection…m
A health service is an
autonomic construct

How else could we build autonomic
platform tools?

For example, could we build a tool to
robustly notify all the applications when
something important happens?


E.g. “System overload! Please scale back all
non-vital functionality”
Could we build a tool to “make a map”
showing the status of a large system?
Gossip: A valuable tool…


So-called gossip protocols can be robust
even when everything else is
malfunctioning
Idea is to build a distributed protocol a
bit like gossip among humans


“Did you hear that Sally and John are
going out?”
Gossip spreads like lightening…
Gossip: basic idea

Node A encounters “randomly selected”
node B (might not be totally random)

Gossip push (“rumor mongering”):


Gossip pull (“anti-entropy”)


A tells B something B doesn’t know
A asks B for something it is trying to “find”
Push-pull gossip

Combines both mechanisms
Definition: A gossip protocol…




Uses random pairwise state merge
Runs at a steady rate (and this rate is
much slower than the network RTT)
Uses bounded-size messages
Does not depend on messages getting
through reliably
Gossip benefits… and limitations




Information flows
around disruptions
Scales very well
Typically reacts to
new events in log(N)
Can be made selfrepairing




Rather slow
Very redundant
Guarantees are at
best probabilistic
Depends heavily on
the randomness of
the peer selection
For example



We could use gossip to track the health
of system components
We can use gossip to report when
something important happens
In the remainder of today’s talk we’ll
focus on event notifications. Next week
we’ll look at some of these other uses
Typical push-pull protocol

Nodes have some form of database of
participating machines


Could have a hacked bootstrap, then use gossip to
keep this up to date!
Set a timer and when it goes off, select a
peer within the database



Send it some form of “state digest”
It responds with data you need and its own state
digest
You respond with data it needs
Where did the “state” come
from?



The data eligible for gossip is usually
kept in some sort of table accessible to
the gossip protocol
This way a separate thread can run the
gossip protocol
It does upcalls to the application when
incoming information is received
Gossip often runs over UDP

Recall that UDP is an “unreliable” datagram
protocol supported in internet




Unlike for TCP, data can be lost
Also packets have a maximum size, usually 4k or
8k bytes (you can control this)
Larger packets are more likely to get lost!
What if a packet would get too large?

Gossip layer needs to pick the most valuable stuff
to include, and leave out the rest!
Algorithms that use gossip


Gossip is a hot topic!
Can be used to…





Notify applications about some event
Track the status of applications in a system
Organize the nodes in some way (like into
a tree, or even sorted by some index)
Find “things” (like files)
Let’s look closely at an example
Bimodal multicast


This is Cornell work from about 10
years ago
Goal is to combine gossip with UDP
(also called IP) multicast to make a very
robust multicast protocol
Stock Exchange Problem:
Sometimes, someone is slow…
Most members are
healthy….
… but one is slow
i.e. something is contending with the red process,
delaying its handling of incoming messages…
With classical reliable multicast, throughput
collapses as the system scales up!
Even if we have just
one slow receiver…
as the group gets
larger (hence more
healthy receivers),
impact of a
performance
perturbation is more
and more evident!
average throughput on nonperturbed members

Virtually synchronous Ensemble multicast protocols
250
group size: 32
group size: 64
group size: 96
32
200
150
100
96
50
0
0
0.1
0.2
0.3
0.4
0.5
perturb rate
0.6
0.7
0.8
0.9
Why does this happen?


Most reliable multicast protocols are based on
an ACK/NAK scheme (like TCP but with multiple
receivers). Sender retransmits lost packets.
As number of receivers gets large ACKS/NAKS
pile up (sender has more and more work to do)



Hence it needs longer to discover problems
And this causes it to buffer messages for longer and
longer… hence flow control kicks in!
So the whole group slow down
Start by using unreliable UDP multicast to
rapidly distribute the message. But some
messages may not get through, and some
processes may be faulty. So initial state
involves partial distribution of multicast(s)
Periodically (e.g. every 100ms) each process
sends a digest describing its state to some
randomly selected group member. The digest
identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its
own history and solicits a copy of any missing
message from the process that sent the gossip
Processes respond to solicitations received
during a round of gossip by retransmitting the
requested message. The round lasts much longer
than a typical RPC time.
Delivery? Garbage Collection?

Deliver a message when it is in FIFO order



Report an unrecoverable loss if a gap persists for
so long that recovery is deemed “impractical”
Garbage collect a message when you believe
that no “healthy” process could still need a
copy (we used to wait 10 rounds, but now
are using gossip to detect this condition)
Match parameters to intended environment
Need to bound costs

Worries:



Someone could fall behind and never catch
up, endlessly loading everyone else
What if some process has lots of stuff
others want and they bombard him with
requests?
What about scalability in buffering and in
list of members of the system, or costs of
updating that list?
Optimizations


Request retransmissions most recent
multicast first
Idea is to “catch up quickly” leaving at
most one gap in the retrieved sequence
Optimizations

Participants bound the amount of data
they will retransmit during any given
round of gossip. If too much is solicited
they ignore the excess requests
Optimizations


Label each gossip message with
senders gossip round number
Ignore solicitations that have expired
round number, reasoning that they
arrived very late hence are probably no
longer correct
Optimizations

Don’t retransmit same message twice in
a row to any given destination (the
copy may still be in transit hence
request may be redundant)
Optimizations

Use UDP multicast when retransmitting a
message if several processes lack a copy




For example, if solicited twice
Also, if a retransmission is received from “far
away”
Tradeoff: excess messages versus low latency
Use regional TTL to restrict multicast scope
Scalability




Protocol is scalable except for its use of
the membership of the full process
group
Updates could be costly
Size of list could be costly
In large groups, would also prefer not
to gossip over long high-latency links
Router overload problem



Random gossip can overload a central
router
Yet information flowing through this
router is of diminishing quality as rate
of gossip rises
Insight: constant rate of gossip is
achievable and adequate
Limited
bandwidth
(1500Kbits)
Receiver
Bandwidth of
other
links:10Mbits
Sender
Receiver
High noise rate
150
200
Uniform
Hierarchical
180
160
140
Uniform
Hierarchical
100
120
100
80
60
40
20
0
link (msgs/sec)
Load on WANnmsgs/sec
delivery (ms)
Latency tonmsgs/sec
Sender
0
10
20
30
40
50
sec
60
70
80
90
100
50
0
0
5
10
15
group size
20
25
30
Hierarchical Gossip



Weight gossip so that probability of
gossip to a remote cluster is smaller
Can adjust weight to have constant load
on router
Now propagation delays rise… but just
increase rate of gossip to compensate
1500
150
1000
100
nmsgs/sec
propagation time(ms)
Uniform
Hierarchical
Fast hierarchical
500
50
Uniform
Hierarchical
Fast hierarchical
0
0
5
10
15
group size
20
25
30
0
0
5
10
15
group size
20
25
30
Idea behind analysis



Can use the mathematics of epidemic
theory to predict reliability of the
protocol
Assume an initial state
Now look at result of running B rounds
of gossip: converges exponentially
quickly towards atomic delivery
p{#processes=k}
Pbcast bimodal delivery distribution
Either sender
fails…
1.E+00
… or data gets
through w.h.p.
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
0
5
10
15
20
25
30
35
40
45
50
number of processes to deliver pbcast
Failure analysis



Suppose someone tells me what they
hope to “avoid”
Model as a predicate on final system
state
Can compute the probability that pbcast
would terminate in that state, again
from the model
Two predicates

Predicate I: A faulty outcome is one
where more than 10% but less than
90% of the processes get the multicast
… Think of a probabilistic Byzantine
General’s problem: a disaster if many
but not most troops attack
Two predicates

Predicate II: A faulty outcome is one where
roughly half get the multicast and failures
might “conceal” true outcome
… this would make sense if using pbcast to
distribute quorum-style updates to replicated
data. The costly hence undesired outcome is
the one where we need to rollback because
outcome is “uncertain”
Two predicates



Predicate I: More than 10% but less than
90% of the processes get the multicast
Predicate II: Roughly half get the multicast
but crash failures might “conceal” outcome
Easy to add your own predicate. Our
methodology supports any predicate over
final system state
Scalability of Pbcast reliability
P{failure}
1.E-05
1.E-10
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
10
15
20
25
30
35
40
45
50
55
60
#processes in system
Predicate I
Predicate II
P{failure}
Effects of fanout on reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
1
2
3
4
5
6
7
8
9
10
fanout
Predicate I
Predicate II
fanout
Fanout required for a specified reliability
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
25
30
35
40
45
50
#processes in system
Predicate I for 1E-8 reliability
Predicate II for 1E-12 reliability
Scalability of Pbcast reliability
1.E+00
1.E-05
1.E-05
1.E-10
P{failure}
p{#processes=k}
Pbcast bimodal delivery distribution
1.E-10
1.E-15
1.E-20
1.E-15
1.E-20
1.E-25
1.E-30
1.E-35
1.E-25
10
15
1.E-30
0
5
10
15
20
25
30
35
40
45
P{failure}
fanout
5
6
7
40
45
50
55
60
Predicate II
9
8.5
8
7.5
7
6.5
6
5.5
5
4.5
4
20
4
35
Fanout required for a specified reliability
1.E+00
1.E-02
1.E-04
1.E-06
1.E-08
1.E-10
1.E-12
1.E-14
1.E-16
3
30
Predicate I
Effects of fanout on reliability
2
25
#processes in system
50
number of processes to deliver pbcast
1
20
8
9
10
25
30
35
40
45
50
#processes in system
fanout
Predicate I for 1E-8 reliability
Predicate I
Predicate II
Figure 5: Graphs of analytical results
Predicate II for 1E-12 reliability
Experimental work

SP2 is a large network





Nodes are basically UNIX workstations
Interconnect is basically an ATM network
Software is standard Internet stack (TCP, UDP)
We obtained access to as many as 128 nodes
on Cornell SP2 in Theory Center
Ran pbcast on this, and also ran a second
implementation on a real network
Example of a question



Create a group of 8 members
Perturb one member in style of Figure 1
Now look at “stability” of throughput


Measure rate of received messages during
periods of 100ms each
Plot histogram over life of experiment
Source to dest latency distributions
0.6
0.4
Traditional Protocol
w ith .45 sleep
probability
0.2
0
Inter-arrival spacing (m s)
1
0.8
0.6
Notice that in
practice, bimodal
Pbcast w ith .05
multicast is fast!
sleep probability
Pbcast w ith .45
sleep probability
0.4
0.2
0
0.
00
5
0.
01
5
0.
02
5
0.
03
5
0.
04
5
0.
05
5
0.
06
5
Traditional Protocol
w ith .05 sleep
probability
Histogram of throughput for pbcast
Probability of occurence
0.8
0.
00
5
0.
01
5
0.
02
5
0.
03
5
0.
04
5
0.
05
5
0.
06
5
Probability of occurence
Histogram of throughput for Ensemble's FIFO
Virtual Synchrony Protocol
Inter-arrival spacing (m s)
Now revisit Figure 1 in detail



Take 8 machines
Perturb 1
Pump data in at varying rates, look at
rate of received messages
Revisit our original scenario with
perturbations (32 processes)
High bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional: at unperturbed host
pbcast: at unperturbed host
180
traditional: at perturbed host
pbcast: at perturbed host
160
Low bandwidth comparison of pbcast performance at faulty and correct hosts
200
traditional w/1 perturbed
pbcast w/1 perturbed
throughput for traditional, measured at perturbed host
throughput for pbcast measured at perturbed host
180
160
140
average throughput
average throughput
140
120
100
80
120
100
80
60
60
40
40
20
20
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
0
0.1
0.2
0.3
0.4
0.6
0.5
perturb rate
0.7
0.8
0.9
Throughput variation as a
function of scale
mean and standard deviation of pbcast throughput: 96-member group
220
215
215
210
210
throughput (msgs/sec)
throughput (msgs/sec)
mean and standard deviation of pbcast throughput: 16-member group
220
205
200
195
205
200
195
190
190
185
185
180
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
180
0.5
mean and standard deviation of pbcast throughput: 128-member group
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
standard deviation of pbcast throughput
220
150
215
standard deviation
throughput (msgs/sec)
210
205
200
195
100
50
190
185
180
0
0.05
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
0
0
50
100
process group size
150
Impact of packet loss on
reliability and retransmission rate
Pbcast background overhead: perturbed process percentage (25%)
Pbcast with system-wide message loss: high and low bandwidth
100
200
8 nodes
16 nodes
64 nodes
128 nodes
90
180
140
120
100
hbw:8
hbw:32
hbw:64
hbw:96
lbw:8
lbw:32
lbw:64
lbw:96
80
60
40
20
0
retransmitted messages (%)
average throughput of receivers
80
160
70
60
50
40
30
20
10
0
0
0.02
0.04
0.06
0.08
0.1
0.12 0.14
system-wide drop rate
0.16
0.18
0.2
0
0.05
Notice that when network becomes
overloaded, healthy processes
experience packet loss!
0.1
0.15
0.2
0.25
0.3
perturb rate
0.35
0.4
0.45
0.5
What about growth of
overhead?



Look at messages other than original
data distribution multicast
Measure worst case scenario: costs at
main generator of multicasts
Side remark: all of these graphs look
identical with multiple senders or if
overhead is measured elsewhere….
16 nodes - 4 perturbed processes
100
80
60
40
20
0
average
overhead
average
overhead
(msgs/sec)
8 nodes - 2 perturbed processes
0.1
0.2
0.3
0.4
100
80
60
40
20
0
0.1
0.5
0.2
100
80
60
40
20
0
0.3
0.4
perturb rate
0.5
128 nodes - 32 perturbed processes
average
overhead
average
overhead
64 nodes - 16 perturbed processes
0.2
0.4
perturb rate
perturb rate
0.1
0.3
0.5
100
80
60
40
20
0
0.1
0.2
0.3
0.4
perturb rate
0.5
Growth of Overhead?



Clearly, overhead does grow
We know it will be bounded except for
probabilistic phenomena
At peak, load is still fairly low
Pbcast versus SRM, 0.1%
packet loss rate on all links
PBCAST and SRM with system wide constant noise, tree topology
PBCAST and SRM with system wide constant noise, tree topology
15
15
adaptive SRM
repairs/sec received
Tree
networks
requests/sec received
adaptive SRM
10
SRM
5
10
SRM
5
Pbcast-IPMC
0
Pbcast
Pbcast-IPMC
0
10
20
30
40
50
60
group size
70
80
90
100
Pbcast
0
0
10
PBCAST and SRM with system wide constant noise, star topology
20
30
40
50
60
group size
70
80
90
100
PBCAST and SRM with system wide constant noise, star topology
15
60
50
SRM
10
SRM
repairs/sec received
Star
networks
requests/sec received
adaptive SRM
5
40
30
20
adaptive SRM
10
0
Pbcast
Pbcast-IPMC
0
10
20
30
40
50
60
group size
70
80
90
100
0
0
10
20
30
40
50
60
group size
70
80
Pbcast
Pbcast-IPMC
90
100
Pbcast versus SRM: link
utilization
PBCAST and SRM with system wide constant noise, tree topology
PBCAST and SRM with system wide constant noise, tree topology
20
Pbcast
Pbcast-IPMC
SRM
Adaptive SRM
45
40
link utilization on an incoming link to sender
link utilization on an outgoing link from sender
50
35
30
25
20
15
10
5
0
Pbcast
Pbcast-IPMC
SRM
Adaptive SRM
18
16
14
12
10
8
6
4
2
0
10
20
30
40
50
60
group size
70
80
90
100
0
0
10
20
30
40
50
60
group size
70
80
90
100
Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology
30
Pbcast and SRM with 0.1% system wide constant noise, 1000-node tree topology
30
25
25
adaptive SRM
20
repairs/sec received
requests/sec received
Pbcast versus SRM: 300 members on a
1000-node tree, 0.1% packet loss rate
SRM
15
10
5
0
adaptive SRM
20
SRM
15
10
5
Pbcast
Pbcast-IPMC
0
20
40
60
80
group size
100
120
0
Pbcast-IPMC
Pbcast
0
20
40
60
80
group size
100
120
Pbcast Versus SRM:
Interarrival Spacing
Pbcast versus SRM: Interarrival spacing (500
nodes, 300 members, 1.0% packet loss)
1% system wide noise, 500 nodes, 300 members, pbcast-grb
1% system wide noise, 500 nodes, 300 members, srm
100
90
90
90
80
80
80
70
60
50
40
30
percentage of occurrences
100
percentage of occurrences
percentage of occurrences
1% system wide noise, 500 nodes, 300 members, pbcast-grb
100
70
60
50
40
30
70
60
50
40
30
20
20
20
10
10
10
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35
latency at node level (second)
0.4
0.45
0.5
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35 0.4
latency after FIFO ordering (second)
0.45
0.5
0
0
0.05
0.1
0.15 0.2 0.25 0.3 0.35
latency at node level (second)
0.4
0.45
0.5
Real Data: Bimodal Multicast on a
10Mbit ethernet (35 Ultrasparc’s)
Injected noise, retransmission
limit disabled
Injected noise, retransmission
limit re-enabled
150
180
140
160
130
140
120
120
110
#msgs
200
100
100
80
90
60
80
40
70
20
60
0
0
10
20
30
40
50
sec
60
70
80
90
100
50
0
10
20
30
40
50
sec
60
70
80
90
100
Networks structured as
clusters
Receiver
Sender
High noise rate
Limited
bandwidth
(1500Kbits)
Bandwidth of
other
links:10Mbits
Receiver
Sender
Delivery latency in a 2-cluster LAN,
50% noise between clusters, 1% elsewhere
partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb
45
40
40
35
35
percentage of occurrences
percentage of occurrences
partitioned nw, 50% noise between clusters, 1% system wide noise, n=80, pbcast-grb
45
30
25
20
15
30
25
20
15
10
10
5
5
0
0
0
1
2
3
4
latency at node level (second)
5
6
0
1
45
40
40
35
35
30
25
20
15
15
5
2
3
4
latency at node level (second)
6
20
5
1
5
25
10
0
6
30
10
0
5
srm adaptive
45
percentage of occurrences
percentage of occurrences
srm
2
3
4
latency after FIFO ordering (second)
5
6
0
0
1
2
3
4
latency at node level (second)
Requests/repairs and latencies
with bounded router bandwidth
limited bw on router, 1% system wide constant noise
limited bw on router, 1% system wide constant noise
200
120
pbcast-grb
srm
pbcast-grb
srm
180
100
140
repairs/sec received
requests/sec received
160
120
100
80
60
40
80
60
40
20
20
0
0
20
40
60
group size
80
100
120
limited bw on router, noise=1%, n=100, pbcast-grb
20
40
60
group size
5
2
4
6
8
10
12
14
latency at node level (second)
16
18
20
10
5
0
100
120
15
percentage of occurrences
10
0
80
limited bw on router, noise=1%, n=100, srm
15
percentage of occurrences
percentage of occurrences
0
limited bw on router, noise=1%, n=100, pbcast-grb
15
0
0
0
2
4
6
8
10
12
14
latency after FIFO ordering (second)
16
18
20
10
5
0
0
2
4
6
8
10
12
14
latency at node level (second)
16
18
20
Summary



Gossip is a valuable tool for addressing
some of the needs of modern
autonomic computing
Often paired with other mechanisms, eg
anti-entropy paired with UDP multicast
Solutions scale well (if well designed!)
Download