Mahesh Balakrishnan
Ken Birman
Cornell University
COTS Datacenters
Online e-tailers, search engines, corporate applications
Web-services
Mission-Critical Apps
Need: Scalability, Availability, Fault-Tolerance
… Timeliness!
Migrating time-critical applications to commodity datacenters…
… conversely, providing datacenter webservices with time-critical performance.
Not ‘real time’, but ‘real fast’!
Financial calculators, military command and control… air traffic control (ATC)
… foobooks.com!
Technology Gap: Real-Time focuses on determinism, scale-up architectures
Mid to Late 90’s
Teams of 3-5 air traffic controllers on a cluster of desktop consoles
50-200 of these console clusters in an air traffic control center
Why study the French ATC?
Radar Image
Weather Alert
Track Updates
Updates to Flight Plans
Console to Console State Updates
System Management and Monitoring
ATC center to center Updates
Multicast ubiquitous…
Virtually Synchronous Multicast: very reliable, not particularly fast
Unreliable Multicast: very fast, not particularly reliable
Nothing in between!
Category 1: Complete reliability (virtual synchrony) e.g: Routing decisions
Category 2: Careful application design + natural hardware properties + management policies. e.g: Radar
Engineering Lessons:
Structure application to tolerate partial failures
Exploit natural hardware properties
Can we generalize to modern systems?
Research Direction: Time-Critical Reliability
Can we design communication primitives that encapsulate these lessons?
Updates multicast to whole group RACS
Queries unicast to single nodes
An Amazon web-page is constructed by
100s of co-operating services*
Multicast is used for:
Updating Cloned Services
Publish-Subscribe / Eventing
Datacenter Management/Monitoring
* Werner Vogels, CTO of amazon.com, at SOSP 2005
A node is in many multicast groups:
One for each service it hosts
One for each topic it subscribes to
One or more administration groups
Large Numbers of Overlapping Groups!
User
History
Service
Product
Popularity
Service
Store Inventory
Shipping
Scheduler
User Profile
Data
Data Store
Services: stale data can result in overselling / underselling loss of realworld dollars
Cache
Services: updated periodically by back-end data-stores
Product
Recommendations
Datacenter Blades are failure-prone:
Crash failures
Byzantine behavior
Bursty Packet Loss :
End-hosts kernels drop packets when subjected to traffic spikes.
Rapid delivery is more important than perfect reliability
Probabilistic Timeliness
Graceful Degradation
Wanted: a multicast primitive that
2.
3.
4.
5.
1.
Scales to large numbers of arbitrarily overlapping multicast groups
Delivers multicasts quickly
Tolerates datacenter failure modes – bursty packet loss, node failures
Offers probabilistic properties
‘Gives up’ on lost data after a threshold period
Ricochet: Lateral Error Correction
Receivers exchange error correction
XORs of multicast traffic
Works very well with multiple groups – scales upto a thousand groups per node
Probabilistic Timeliness: probability distribution of delivery latencies
Delivers messages to applications with no ordering delay in most cases
Orders messages only if there is a high probability of out-of-order delivery across different nodes
Probabilistic Timeliness: probability distribution of ordered delivery latency
SRM takes seconds to recover lost packets
Ricochet recovers almost all packets within ~70 milliseconds
Move from R/T to T/C yields huge benefits!
Ricochet is faster… slashes latency… scalable…
Clean delivery delay curve a powerful design tool, replaced traditional hard (but conservative) limits
We’re open for business:
Software and detailed paper available for download
Give it a try… tell us what you think!
www.cs.cornell.edu/projects/quicksilver/ricochet.html