PLATO: Predictive LatencyAware Total Ordering Mahesh Balakrishnan Ken Birman Amar Phanishayee Total Ordering a.k.a Atomic Broadcast delivering messages to a set of nodes in the same order messages arrive at nodes in different orders… nodes agree on a single delivery order messages are delivered at nodes in the agreed order Modern Datacenters Applications E-tailers, Finance, Aerospace Service-Oriented Architectures, PublishSubscribe, Distributed Objects, Event Notification… … Totally Ordered Multicast! Hardware Fast high-capacity networks Failure-prone commodity nodes Total Ordering in a Datacenter Updates are Totally Ordered Replicated Service Query Update 1 Inventory Service Replica 1 Update 2 Query Inventory Service Replica 2 Totally Ordered Multicast is used to consistently update Replicated Services Latency of Multicast System Consistency Requirement: order multicasts consistently, rapidly, robustly Multicast Wishlist Low Latency! High (stable) throughput Leverage hardware properties Minimal, proactive overheads HW Multicast/Broadcast is fast, unreliable Handle varying data rates Datacenter workloads have sharp spikes… and extended troughs! State-of-the-Art Traditional Protocols Example: Fixed Sequencer Simple, works well Optimistic Total Ordering: Conservative Latency-Overhead tradeoff deliver optimistically, rollback if incorrect Why this works – No out-of-order arrival in LANs Optimistic total ordering for datacenters? PLATO: Predictive Ordering In a datacenter, broadcast / multicast occurs almost instantaneously Most of the time, messages arrive in same order at all nodes. Some of the time, messages arrive in different orders at different nodes. Can we predict out-of-order arrival? Reasons for Disorder: Swaps Receives Sender 2's message after Sender 1's message Receives Sender 1's message after Sender 2's message Receiver 1 Receiver 2 Switch Sender 1 Switch Sender 2 Typical Datacenter Diameter: 50-500 microseconds Out-of-order arrival can occur when the inter-send interval between two messages is smaller than the diameter of the network Reasons for Disorder: Loss Datacenter networks are overprovisioned Loss never occurs in the network Datacenter nodes are cheap Loss occurs due to end-host buffer overflows caused by CPU contention G E F D H E D C E B D A C t A B C Order of arrivals into user-space D E H F G Emulab Testbed (Utah) 850 Mhz Cisco 6509 4 Gb Cisco 6509 100 Mb 600 Mhz 4 Gb 100 Mb 4 Gb Emulab2 test scenario: 2 switches of separation One-way ping latency: ~100 microseconds Cisco 6509 850 Mhz Cisco 6509 100 Mb 1 Gb 8 Gb Cisco 6513 850 Mhz 100 Mb 2 Ghz 3 GHz Emulab3 test scenario: 3 switches of separation One-way ping latency: ~110 microseconds The Utah Emulab Testbed Cornell Testbed 100 Mb HP Procurve 6108 1 Gb 1 Gb 1 Gb The Cornell Testbed HP Procurve 4000M 1.3 Ghz HP Procurve 4000M Cornell3 test scenario: 3 switches of separation One-way ping latency: ~70 microseconds 1.3 Ghz 100 Mb 100 Mb 1 Gb 1 Gb 1 Gb HP Procurve 6108 HP Procurve 4000M 1.3 Ghz HP Procurve 4000M HP Procurve 6108 1.3 Ghz 100 Mb Cornell5 test scenario: 5 switches of separation One-way ping latency: ~110 microseconds Disorder: Emulab3 Percentage of swaps and losses goes up with data rate At 2800 packets per sec, 2% of all packet pairs are swapped and 0.5% of packets are lost. Disorder Predicting Disorder Predictor: Inter-arrival time of consecutive packets into user-space Why? Swaps: simultaneous multicasts low inter-arrival time Loss: kernel buffer overflow sequence of low inter-arrival times Predicting Disorder Inter-arrival time of swaps 95% of swaps and 14% of all pairs are within 128 µsecs Inter-arrival time of all pairs Cornell Datacenter, 400 multicasts/sec Predicting Disorder PLATO Design Heuristic: If two packets arrive within Δ µsecs, possibility of disorder PLATO Heuristic + Lazy Fixed Sequencer Heuristic works ~ zero (Δ) latency Heuristic fails fixed sequencer latency PLATO Design API: optdeliver, confirm, revoke Ordering Layer: Pending Queue: Packets suspected to be out-of-order, or queued behind suspected packets Suspicious Queue: Packets optdelivered to the application, not yet confirmed PLATO Design optdeliver(A) optdeliver(E) optdeliver(B) optdeliver(D) A pending E suspicious TE-TA>DELTA D B E A B revoke(D) setsuspect(D) setsuspect(C) D D C TC-TD<DELTA C suspicious B Or d Underlined packets in pending are suspected pending E A Se q er Ms g :A BC D revoke(E) setsuspect(E) pending E suspicious confirm(A, B, C, D) t Performance Fixed Sequencer PLATO At small values of Δ, very low latency of delivery but more rollbacks Performance Latency of both FixedSequencer and PLATO decreases as throughput increases Performance Traffic Spike: PLATO is insensitive to data rate, while Fixed Sequencer depends on data rate Performance Latency is as good as static Δ parameterization Δ is varied adaptively in reaction to rollbacks Conclusion First optimistic total order protocol that predicts out-of-order delivery Slashes ordering latency in datacenter settings Stable at varying loads Ordering layer of a time-critical protocol stack for Datacenters