Causal Consistency Without Dependency Check Messages Willy Zwaenepoel INTRODUCTION Geo-replicated data stores • Geo-replicated data centers • Full replication between data centers • Data in each data center partitioned Data center Data center Data center Data center Data center The Consistency Dilemma Strong Consistency • synchronous replication • all replicas share same consistent view • sacrifice availability Causal Consistency • asynchronous replication • all replicas eventually converge • sacrifice consistency, but … • … replication respects causality Eventual Consistency • asynchronous replication • all replicas eventually converge • sacrifice consistency The cost of causal consistency Throughput 700 600 Kops/sec 500 400 300 Eventual 200 Causal 100 0 1 2 3 4 Partitions 5 6 Can we close the throughput gap? • The answer is: yes, but there is a price STATE OF THE ART: WHY THE GAP? How is causality enforced? • Each update has associated dependencies • What are dependencies? – metadata to establish causality relations between operations – used only for data replication • Internal dependencies – previous updates of the same client session • External dependencies – read updates of other client sessions Internal dependencies Alice US Datacenter Europe Datacenter Bob Example of 2 users performing operations at different datacenters at the same partition External dependencies Alice Bob US Datacenter Europe Datacenter Charlie Example of 3 users performing operations at datacenters at the same partition How dependencies are tracked & checked • In current implementations – COPS [SOSP ’11], ChainReaction [Eurosys ‘13], Eiger [NSDI ’13], Orbe [SOCC ’13] • DepCheck(A) – “Do you have A installed yet?” Read(B) Client Partition 0 Partition 1 … Partition N Partition 0 Partition 1 … Partition N US Datacenter Europe Datacenter DepCheck(A) DepCheck(B) Encoding of dependencies • COPS [SOSP ’11], ChainR. [Eurosys ‘13], Eiger [NSDI ’13] – “direct” dependencies – Worst case: O( reads before a write ) • Orbe [SOCC ‘13] – Dependency matrix – Worst case: O( partitions ) The main issues • Metadata size is considerable – for both storage and communucation • Remote dependency checks are expensive – multiple partitions are queried for each update The cost of causal consistency Throughput 700 600 Kops/sec 500 400 300 Eventual 200 Causal 100 0 1 2 3 4 Partitions 5 6 The cost of dependency check messages Throughput 700 600 Kops/sec 500 400 Eventual 300 Causal 200 Causal - Dep check msgs 100 0 1 2 3 4 Partitions 5 6 CAUSAL CONSISTENCY WITH 0/1 DEPENDENCY CHECK MESSAGES Getting rid of external dependencies • Partitions serve only fully replicated updates – Replication Confirmation messages broadcast periodically • External dependencies are removed – replication information implies dependency installation • Internal dependencies are minimized – we only track the previous write – requires at most one remote check • zero if write is local, one if it is remote The new replication workflow Alice Replication Confirmation (periodically) US Datacenter Europe Datacenter Asia Datacenter Bob Example of 2 users performing operations at different datacenters at the same partition Reading your own writes • Clients need not wait for the replication confirmation – they can see their own updates immediately – other clients’ updates are visible once they are fully replicated • Multiple logical update spaces Alice’s update space (visible to Alice) Bob’s update space (visible to Bob) … Replication update space (not yet visible) Global update space (fully visible) The cost of causal consistency Throughput 700 600 Kops/sec 500 400 Eventual 300 01-msg Causal 200 Conventional Causal 100 0 1 2 3 4 Partitions 5 6 The price paid: update visibility increased • With new implementation ~ max ( network latency from origin to furthest replica + network latency from furthest replica to destination + interval of replication information broadcast ) • With conventional implementation: ~ network latency from origin to destination CAUSAL CONSISTENCY WITHOUT DEPENDENCY CHECK MESSAGES ?! Is it possible? • Only make update visible when one can locally determine no causally preceding update will become visible later at another partition How to do that? • Encode causality by means of Lamport clock • Each partition maintains its Lamport clock • Each update is timestamped with Lamport clock • Update visible – update.timestamp ≤ min( Lamport clocks ) • Periodically compute minimum 0-msg causal consistency - throughput Kops/sec Throughput 900 800 700 600 500 400 300 200 100 0 Eventual 0-msg Causal Conventional Causal 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Partitions The price paid: update visibility increased • With new implementation ~ max ( network latency from origin to furthest replica + network latency from furthest replica to destination + interval of minimum computation ) • With conventional implementation: ~ network latency from origin to destination How to deal with “stagnant” Lamport clock? • Lamport clock stagnates if no update in a partition • Combine – Lamport clock – Loosely synchronized physical clock – (easy to do) More on loosely synchronized physical clocks • Periodically broadcast clock • Reduces update visibility latency to – Network latency from furthest replica to destination + maximum clock skew + clock broadcast interval Can we close the throughput gap? • The answer is: yes, but there is a price • The price is increased update visibility Conclusion: Throughput, messages, latency Method Throughput # of dep.check messages Update visibility Conventional < Evt. consistency O( Rs since W ) O( partitions ) D D 01-msg ~ Evt. consistency O or 1 ~ 2 Dmax 0-msg ~ Evt. consistency 0 ~ 2 Dmax 0-msg + physical clock ~ Evt. consistency 0 ~ Dmax