Read timestamp

advertisement
Consistency Guarantees and
Snapshot isolation
Marcos Aguilera, Mahesh Balakrishnan,
Rama Kotla, Vijayan Prabhakaran,
Doug Terry
MSR Silicon Valley
Goals
Develop a cloud storage system featuring
1. multiple consistency levels
– requires one API to learn, one system to administer
– handles diversity of requirements within and across
applications
2. read-write transactions
– with snapshot isolation
– on replicated and partitioned data
3. consistency-based SLAs
Geo-Replication
remote datacenter
datacenter
secondaries
remote
secondaries
primary
Write
Read
Client API
Puts/
Gets
Transaction
– Get (key)
– Put (key, object)
– BeginTx (consistency)
– EndTx ()
– BeginSession (consistency)
– EndSession ()
Session
Transaction Properties
• Conventional transaction model
– BeginTx … EndTx
• Atomic updates to multiple objects
• Multi-object reads from snapshots
• Even across partitions
Partitioned Data for Scalability
• Data partitioned by key range
• Each partition has its own primary and
secondary servers
Key-range
Primary
Secondaries
A-F
S1
S2, S4
G-P
S2
S4, S5
Q-Z
S3
S1, S4, S5
Write Operations
• Writes performed at primary server(s)
– May have different primaries for different objects
• Propagate to secondary servers eventually
– Any gossip or anti-entropy protocol will do
• Have a commit timestamp, i.e. global order
– And deterministic outcomes
• No write conflicts
=> All replicas converge towards a mutually
consistent state
Versioned Data Store
• Store version history for each object
Object A
Object B
V1
V2
V1
V3
V4
V2
time
• Can perform writes as soon as commit
timestamp is known
– need not perform writes in commit order
• Can eventually prune old versions
Per-Replica State
• Datastore = set of <key, value, timestamp>
• High-time = timestamp of latest received write
transaction
– Assumes transactions are received in order
– May receive periodic null transactions
• Low-time = timestamp of most recent
discarded object version
Read Operations
• Single-key Gets go to one server
• Multi-partition transactions may read from
multiple servers
• Server(s) selected based on desired
consistency
– E.g. read from nearby server when possible
• Alternative: Broadcast operation to all servers
– Take first response that is consistent enough
Read-Only Transactions
• Transaction assigned a read timestamp
• Read from snapshot at that time
– See all write transactions committed before this
time, and only those writes
• Consistency guarantee places constraints on
read timestamp
Reads on Versioned Data Store
• Allows reads at any timestamp
– Without placing constraints on write propagation
Object A
Object B
V1
V2
V1
V3
V4
V2
Read timestamp
time
• Assuming no future transaction could be assigned
a commit timestamp before the read timestamp
Selecting Read Timestamp
Guarantee
Read timestamp
Strong Consistency
now (or time of last committed write)
Eventual Consistency
any time
Consistent Prefix
any time
Bounded Staleness
any time within bound
Monotonic Reads
any time later or equal to that of previous
read transaction in this session
Read My Writes
any time later or equal to that of previous
write transaction in this session
assuming in-order
delivery of writes
Acceptable Read Timestamps
read timestamp
strong
read-my-writes
monotonic
bounded
causal
0
eventual
time
BeginTx
Selecting Read Timestamp
low
high
node A
low
high
node B
low
high
node C
Read timestamp
time
Read-Write Transactions
• Transaction assigned a read timestamp and a
commit timestamp
• Use optimistic concurrency control
– Old read timestamps increase the chance of abort
• Read from snapshot at read timestamp
– With selected consistency guarantee
• Batch writes until commit
– No undo needed
• Validate transaction at commit timestamp
Transaction Lifetime
Transaction
Get(x)
time
…
Put(x, value)
Session
Select read
timestamp and
perform Get
Buffer Put
Get commit
timestamp, validate,
and perform Puts
Committing Write Transactions
Snapshot isolation =>
• Check that no object being written has a
version between the transaction’s read
timestamp and commit timestamp
Serializability =>
• Check that no object being read or written has
a version between the transaction’s read
timestamp and commit timestamp
Download