PLATO: Predictive Latency- Aware Total Ordering Mahesh Balakrishnan Ken Birman

advertisement
PLATO: Predictive LatencyAware Total Ordering
Mahesh Balakrishnan
Ken Birman
Amar Phanishayee
Total Ordering


a.k.a Atomic Broadcast
delivering messages to a set of nodes
in the same order



messages arrive at nodes in different
orders…
nodes agree on a single delivery order
messages are delivered at nodes in the
agreed order
Modern Datacenters

Applications




E-tailers, Finance, Aerospace
Service-Oriented Architectures, PublishSubscribe, Distributed Objects, Event
Notification…
… Totally Ordered Multicast!
Hardware


Fast high-capacity networks
Failure-prone commodity nodes
Total Ordering in a Datacenter
Updates are Totally Ordered
Replicated
Service
Query
Update 1
Inventory Service
Replica 1
Update 2
Query
Inventory Service
Replica 2
Totally Ordered Multicast is used to consistently update Replicated Services
Latency of Multicast  System Consistency
Requirement: order multicasts consistently, rapidly, robustly
Multicast Wishlist

Low Latency!

High (stable) throughput


Leverage hardware properties


Minimal, proactive overheads
HW Multicast/Broadcast is fast, unreliable
Handle varying data rates

Datacenter workloads have sharp spikes… and
extended troughs!
State-of-the-Art

Traditional Protocols



Example: Fixed Sequencer


Simple, works well
Optimistic Total Ordering:



Conservative
Latency-Overhead tradeoff
deliver optimistically, rollback if incorrect
Why this works – No out-of-order arrival in LANs
Optimistic total ordering for datacenters?
PLATO: Predictive Ordering

In a datacenter, broadcast / multicast
occurs almost instantaneously



Most of the time, messages arrive in
same order at all nodes.
Some of the time, messages arrive in
different orders at different nodes.
Can we predict out-of-order arrival?
Reasons for Disorder: Swaps
Receives Sender 2's
message after
Sender 1's message
Receives Sender
1's message after
Sender 2's message
Receiver 1
Receiver 2
Switch
Sender 1
Switch
Sender 2
Typical Datacenter Diameter: 50-500 microseconds
Out-of-order arrival can occur when the inter-send interval between
two messages is smaller than the diameter of the network
Reasons for Disorder: Loss

Datacenter
networks are overprovisioned


Loss never occurs
in the network
Datacenter nodes
are cheap

Loss occurs due to
end-host buffer
overflows caused
by CPU contention
G
E
F
D
H
E
D
C
E
B
D
A
C
t
A
B
C
Order of arrivals into
user-space
D
E
H
F
G
Emulab Testbed (Utah)
850 Mhz
Cisco 6509
4 Gb
Cisco 6509
100 Mb
600 Mhz
4 Gb
100 Mb
4 Gb
Emulab2 test scenario:
2 switches of separation
One-way ping latency:
~100 microseconds
Cisco 6509
850 Mhz
Cisco 6509
100 Mb
1 Gb
8 Gb
Cisco 6513
850 Mhz
100 Mb
2 Ghz
3 GHz
Emulab3 test scenario:
3 switches of separation
One-way ping latency:
~110 microseconds
The Utah Emulab Testbed
Cornell Testbed
100 Mb
HP Procurve 6108
1 Gb
1 Gb
1 Gb
The Cornell Testbed
HP Procurve
4000M
1.3 Ghz
HP Procurve
4000M
Cornell3 test scenario:
3 switches of separation
One-way ping latency:
~70 microseconds
1.3 Ghz
100 Mb
100 Mb
1 Gb
1 Gb
1 Gb
HP Procurve 6108
HP Procurve
4000M
1.3 Ghz
HP Procurve
4000M
HP Procurve 6108
1.3 Ghz
100 Mb
Cornell5 test scenario:
5 switches of separation
One-way ping latency:
~110 microseconds
Disorder: Emulab3
Percentage of swaps
and losses goes up
with data rate
At 2800 packets per
sec, 2% of all packet
pairs are swapped
and 0.5% of packets
are lost.
Disorder
Predicting Disorder


Predictor: Inter-arrival time of
consecutive packets into user-space
Why?


Swaps: simultaneous multicasts
 low inter-arrival time
Loss: kernel buffer overflow
 sequence of low inter-arrival times
Predicting Disorder
Inter-arrival time of swaps

95% of swaps and
14% of all pairs are
within 128 µsecs
Inter-arrival time of all pairs
Cornell Datacenter, 400 multicasts/sec
Predicting Disorder
PLATO Design


Heuristic: If two packets arrive within Δ
µsecs, possibility of disorder
PLATO



Heuristic + Lazy Fixed Sequencer
Heuristic works  ~ zero (Δ) latency
Heuristic fails  fixed sequencer latency
PLATO Design
API:
optdeliver, confirm, revoke
Ordering Layer:
Pending Queue: Packets
suspected to be out-of-order,
or queued behind suspected
packets
Suspicious Queue:
Packets optdelivered to the
application, not yet confirmed
PLATO Design
optdeliver(A)
optdeliver(E)
optdeliver(B)
optdeliver(D)
A
pending
E
suspicious
TE-TA>DELTA
D
B
E
A
B
revoke(D)
setsuspect(D)
setsuspect(C)
D
D
C
TC-TD<DELTA
C
suspicious
B
Or
d
Underlined packets in
pending are suspected
pending
E
A
Se
q
er
Ms
g
:A
BC
D
revoke(E)
setsuspect(E)
pending
E
suspicious
confirm(A, B, C, D)
t
Performance
 Fixed Sequencer
 PLATO
At small values of Δ, very low latency of delivery but more rollbacks
Performance
Latency of
both FixedSequencer
and PLATO
decreases as
throughput
increases
Performance
Traffic Spike:
PLATO is
insensitive to
data rate, while
Fixed Sequencer
depends on data
rate
Performance
Latency is as good
as static Δ
parameterization
Δ is varied
adaptively in reaction
to rollbacks
Conclusion




First optimistic total order protocol that
predicts out-of-order delivery
Slashes ordering latency in datacenter
settings
Stable at varying loads
Ordering layer of a time-critical
protocol stack for Datacenters
Download