Consistency Guarantees and Snapshot isolation Marcos Aguilera, Mahesh Balakrishnan, Rama Kotla, Vijayan Prabhakaran, Doug Terry MSR Silicon Valley Goals Develop a cloud storage system featuring 1. multiple consistency levels – requires one API to learn, one system to administer – handles diversity of requirements within and across applications 2. read-write transactions – with snapshot isolation – on replicated and partitioned data 3. consistency-based SLAs Geo-Replication remote datacenter datacenter secondaries remote secondaries primary Write Read Client API Puts/ Gets Transaction – Get (key) – Put (key, object) – BeginTx (consistency) – EndTx () – BeginSession (consistency) – EndSession () Session Transaction Properties • Conventional transaction model – BeginTx … EndTx • Atomic updates to multiple objects • Multi-object reads from snapshots • Even across partitions Partitioned Data for Scalability • Data partitioned by key range • Each partition has its own primary and secondary servers Key-range Primary Secondaries A-F S1 S2, S4 G-P S2 S4, S5 Q-Z S3 S1, S4, S5 Write Operations • Writes performed at primary server(s) – May have different primaries for different objects • Propagate to secondary servers eventually – Any gossip or anti-entropy protocol will do • Have a commit timestamp, i.e. global order – And deterministic outcomes • No write conflicts => All replicas converge towards a mutually consistent state Versioned Data Store • Store version history for each object Object A Object B V1 V2 V1 V3 V4 V2 time • Can perform writes as soon as commit timestamp is known – need not perform writes in commit order • Can eventually prune old versions Per-Replica State • Datastore = set of <key, value, timestamp> • High-time = timestamp of latest received write transaction – Assumes transactions are received in order – May receive periodic null transactions • Low-time = timestamp of most recent discarded object version Read Operations • Single-key Gets go to one server • Multi-partition transactions may read from multiple servers • Server(s) selected based on desired consistency – E.g. read from nearby server when possible • Alternative: Broadcast operation to all servers – Take first response that is consistent enough Read-Only Transactions • Transaction assigned a read timestamp • Read from snapshot at that time – See all write transactions committed before this time, and only those writes • Consistency guarantee places constraints on read timestamp Reads on Versioned Data Store • Allows reads at any timestamp – Without placing constraints on write propagation Object A Object B V1 V2 V1 V3 V4 V2 Read timestamp time • Assuming no future transaction could be assigned a commit timestamp before the read timestamp Selecting Read Timestamp Guarantee Read timestamp Strong Consistency now (or time of last committed write) Eventual Consistency any time Consistent Prefix any time Bounded Staleness any time within bound Monotonic Reads any time later or equal to that of previous read transaction in this session Read My Writes any time later or equal to that of previous write transaction in this session assuming in-order delivery of writes Acceptable Read Timestamps read timestamp strong read-my-writes monotonic bounded causal 0 eventual time BeginTx Selecting Read Timestamp low high node A low high node B low high node C Read timestamp time Read-Write Transactions • Transaction assigned a read timestamp and a commit timestamp • Use optimistic concurrency control – Old read timestamps increase the chance of abort • Read from snapshot at read timestamp – With selected consistency guarantee • Batch writes until commit – No undo needed • Validate transaction at commit timestamp Transaction Lifetime Transaction Get(x) time … Put(x, value) Session Select read timestamp and perform Get Buffer Put Get commit timestamp, validate, and perform Puts Committing Write Transactions Snapshot isolation => • Check that no object being written has a version between the transaction’s read timestamp and commit timestamp Serializability => • Check that no object being read or written has a version between the transaction’s read timestamp and commit timestamp