Fast Crash Recovery in RAMCloud Motivation • The role of DRAM has been increasing – Facebook used 150TB of DRAM • For 200TB of disk storage • However, there are limitations – DRAM is typically used as cache • Need to worry about consistency and cache misses RAMCloud • Keeps all data in RAM at all times – Designed to scale to thousands of servers – To host terabytes of data – Provides low-latency (5-10 µs) for small reads • Design goals – High durability and availability • Without compromising performance Alternatives • 3x RAM replications – 3x cost and energy – Power failure • RAMCloud keeps one copy in RAM – Two copies on disks • To achieve good availability – Fast crash recovery (64GB in 1-2 seconds) RAMCloud Basics • Thousands of off-the-shelf servers – Each with 64GB of RAM – With Infiniband NICs • Remote access below 10 µs Data Model • Key-value store • Tables of objects – Object • 64-bit ID + 1MB array + 64-bit version number • No atomic updates to multiple objects System Structure • A large number of storage servers – Each server hosts • A master, which manages local DRAM objects and service requests • A backup, which stores copies of objects from other masters on storage • A coordinator – Manages config info and object locations – Not involved in most requests RAMCloud Cluster Architecture client coordinator master backup More on the Coordinator • Maps objects to servers in units of tablets – Hold consecutive key ranges within a single table • For locality reasons – Small tables are stored on a single server – Large tables are split across servers • Clients can cache tablets to access servers directly Log-structured Storage • Logging approach – Each master logs data in memory • Log entries are forwarded to backup servers – Backup servers buffer log entries » Battery-backed • Writes complete once all backup servers acknowledge • A backup server flushes its buffer when full – 8MB segment for logging, buffering, and Ios – Each server can handle 300K 100-byte writes/sec Recovery • When a server crashes, its DRAM content must be reconstructed • 1-2 second recovery time is good enough Using Scale • Simple 3 replica approach – Recovery based on the speed of three disks – 3.5 minutes to read 64GB of data • Scattered over 1,000 disks – Takes 0.6 seconds to read 64GB – Centralized recovery master becomes a bottleneck – 10 Gbps network means 1 min to transfer 64GB of data to the centralized master RAMCloud • Uses 100 recovery masters – Cuts the time down to 1 second Scattering Log Segments • Ideally uniform, but with more details – Need to avoid correlated failures – Need to account for heterogeneity of hardware – Need to coordinate machines not to overflow buffers on individual machines – Need to account for changing memberships of servers due to failures Failure Detection • Periodic pings to random servers – With 99% chance to detect failed servers within 5 rounds • Recovery – Setup – Replay – Cleanup Setup • Coordinator finds log segment replicas – By querying all backup servers • Detecting incomplete logs – Logs are self describing • Starting Partition Recoveries – Each master uploads a will periodically to the coordinator in the event of its demise – Coordinator carries out the will accordingly Replay • Parallel recovery • Six stages of pipelining – At segment granularity – Same ordering of operations on segments to avoid pipeline stalls • Only the primary replicas is involved in recovery Cleanup • Get master online • Free up segments from the previous crash Consistency • Exactly-once semantics • Implementation not yet complete • ZooKeeper handles coordinator failures – Distributed configuration service – With its own replication Additional Failure Modes • Current focus – Recover DRAM content for a single master failure • Failed backup server – Need to know what segments are lost from the server – Rereplicate those lost segments across remaining disks Multiple Failures • Multiple servers fail simultaneously • Recover each failure independently – Some will involve secondary replicas • Based on projection – With 5,000 servers, recovering 40 masters within a rack will take about 2 seconds • Can’t do much when many racks are blacked out Cold Start • Complete power outage • Backups will contact the coordinate as they reboot • Need to quorum of backups before starting reconstructing masters • Current implementation does not perform cold starts Evaluation • 60-node cluster • Each node – 16GB RAM, 1 disk – Infiniband (25 Gbps) • User level apps can talk to NICs bypassing the kernel Results • Can recover lost data at 22 GB/s • A crashed server with 35 GB storage – Can be recovered in 1.6 seconds • Recovery time stays nearly flat from 1 to 20 recovery masters, each talks to 6 disks • 60 recovery masters adds only 10 ms recovery time Results • Fast recovery significantly reduces the risk of data loss – Assume recovery time of 1 sec – The risk of data loss for 100K node is 10-5 in one year – 10x improvement in recovery time, improves reliability by 1,000x • Assumes independent failures Theoretical Recovery Speed Limit • • • • Harder to be faster than a few hundred msec 150 msec to detect failure 100 msec to contact every backup 100 msec to read a single segment from disk Risks • Scalability study based on a small cluster • Can treat performance glitches as failures – Trigger unnecessary recovery • Access patterns can change dynamically – May lead to unbalanced load