Fast Crash Recovery in
RAMCloud
Motivation
• The role of DRAM has been increasing
– Facebook used 150TB of DRAM
• For 200TB of disk storage
• However, there are limitations
– DRAM is typically used as cache
• Need to worry about consistency and cache misses
RAMCloud
• Keeps all data in RAM at all times
– Designed to scale to thousands of servers
– To host terabytes of data
– Provides low-latency (5-10 µs) for small reads
• Design goals
– High durability and availability
• Without compromising performance
Alternatives
• 3x RAM replications
– 3x cost and energy
– Power failure
• RAMCloud keeps one copy in RAM
– Two copies on disks
• To achieve good availability
– Fast crash recovery (64GB in 1-2 seconds)
RAMCloud Basics
• Thousands of off-the-shelf servers
– Each with 64GB of RAM
– With Infiniband NICs
• Remote access below 10 µs
Data Model
• Key-value store
• Tables of objects
– Object
• 64-bit ID + 1MB array + 64-bit version number
• No atomic updates to multiple objects
System Structure
• A large number of storage servers
– Each server hosts
• A master, which manages local DRAM objects and
service requests
• A backup, which stores copies of objects from other
masters on storage
• A coordinator
– Manages config info and object locations
– Not involved in most requests
RAMCloud Cluster Architecture
client
coordinator
master
backup
More on the Coordinator
• Maps objects to servers in units of tablets
– Hold consecutive key ranges within a single table
• For locality reasons
– Small tables are stored on a single server
– Large tables are split across servers
• Clients can cache tablets to access servers
directly
Log-structured Storage
• Logging approach
– Each master logs data in memory
• Log entries are forwarded to backup servers
– Backup servers buffer log entries
» Battery-backed
• Writes complete once all backup servers acknowledge
• A backup server flushes its buffer when full
– 8MB segment for logging, buffering, and Ios
– Each server can handle 300K 100-byte writes/sec
Recovery
• When a server crashes, its DRAM content
must be reconstructed
• 1-2 second recovery time is good enough
Using Scale
• Simple 3 replica approach
– Recovery based on the speed of three disks
– 3.5 minutes to read 64GB of data
• Scattered over 1,000 disks
– Takes 0.6 seconds to read 64GB
– Centralized recovery master becomes a bottleneck
– 10 Gbps network means 1 min to transfer 64GB of
data to the centralized master
RAMCloud
• Uses 100 recovery masters
– Cuts the time down to 1 second
Scattering Log Segments
• Ideally uniform, but with more details
– Need to avoid correlated failures
– Need to account for heterogeneity of hardware
– Need to coordinate machines not to overflow
buffers on individual machines
– Need to account for changing memberships of
servers due to failures
Failure Detection
• Periodic pings to random servers
– With 99% chance to detect failed servers within 5
rounds
• Recovery
– Setup
– Replay
– Cleanup
Setup
• Coordinator finds log segment replicas
– By querying all backup servers
• Detecting incomplete logs
– Logs are self describing
• Starting Partition Recoveries
– Each master uploads a will periodically to the
coordinator in the event of its demise
– Coordinator carries out the will accordingly
Replay
• Parallel recovery
• Six stages of pipelining
– At segment granularity
– Same ordering of operations on segments to avoid
pipeline stalls
• Only the primary replicas is involved in
recovery
Cleanup
• Get master online
• Free up segments from the previous crash
Consistency
• Exactly-once semantics
• Implementation not yet complete
• ZooKeeper handles coordinator failures
– Distributed configuration service
– With its own replication
Additional Failure Modes
• Current focus
– Recover DRAM content for a single master failure
• Failed backup server
– Need to know what segments are lost from the
server
– Rereplicate those lost segments across remaining
disks
Multiple Failures
• Multiple servers fail simultaneously
• Recover each failure independently
– Some will involve secondary replicas
• Based on projection
– With 5,000 servers, recovering 40 masters within
a rack will take about 2 seconds
• Can’t do much when many racks are blacked
out
Cold Start
• Complete power outage
• Backups will contact the coordinate as they
reboot
• Need to quorum of backups before starting
reconstructing masters
• Current implementation does not perform
cold starts
Evaluation
• 60-node cluster
• Each node
– 16GB RAM, 1 disk
– Infiniband (25 Gbps)
• User level apps can talk to NICs bypassing the kernel
Results
• Can recover lost data at 22 GB/s
• A crashed server with 35 GB storage
– Can be recovered in 1.6 seconds
• Recovery time stays nearly flat from 1 to 20
recovery masters, each talks to 6 disks
• 60 recovery masters adds only 10 ms recovery
time
Results
• Fast recovery significantly reduces the risk of
data loss
– Assume recovery time of 1 sec
– The risk of data loss for 100K node is 10-5 in one
year
– 10x improvement in recovery time, improves
reliability by 1,000x
• Assumes independent failures
Theoretical Recovery Speed Limit
•
•
•
•
Harder to be faster than a few hundred msec
150 msec to detect failure
100 msec to contact every backup
100 msec to read a single segment from disk
Risks
• Scalability study based on a small cluster
• Can treat performance glitches as failures
– Trigger unnecessary recovery
• Access patterns can change dynamically
– May lead to unbalanced load