Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May 19-20, 2012 Introduction • Shared memory supports concurrent access – Read & write interface • Memory models: single writer, multiple reader (SWMR) and multiple writer, multiple reader (MWMR) – Consistency is important • Strong consistency provides useful semantics • Abstraction for message-passing networks – Shared memory can be emulated – Difficult to do, but solutions exist – For example applications for the Internet, such as Dropbox Our Research Project THE RAMBO PROJECT • Framework for emulating shared memory – Introduced by Lynch and Shvartsman, extended by Gilbert – Implements the MWMR model with strong consistency – Designed for dynamic distributed message-passing settings OUR GOAL • RAMBO is elegant but not always efficient • Extend RAMBO with intelligent data management Consistency & Atomicity • There are many consistency models • We are interested in atomicity write(8) Atomicity 0 time read(8) Violation (Safety) read(8) write(8) 0 time (Regularity) Violation read(8) read(8) read(0) read(8) write(8) 0 time read(3) read(0) read(8) Emulating Shared Memory User 2: Writer Data: User 1: Reader Data: 5 5 Status: WORKING Data: 5 User 3: Reader Data: 5 Weakness of the Centralized Approach User 2: Writer Data: User 1: Reader Data: error error Status: FAILED Data: User 3: Reader Data: error Replication in Distributed Setting User 2: Writer Data: User 1: Reader Data: 5 5 Status: Status: Status: FAILED WORKING WORKING Data: Data: Data: 5 5 User 3: Reader Data: 5 The ABD Algorithm Hagit Attiya, Amotz Bar-Noy, Danny Dolev A SWMR algorithm • Operation level wait-freedom – Termination unaffected by concurrency • Designed for a message-passing setting – Allows limited failures – Communication is reliable – Messages can be delayed Quorum Systems and ABD • ABD is a quorum based algorithm – Quorum system is a collection of intersecting sets • For example a voting majority quorum system • Data is replicated in a quorum systems – Quorum system members are networked servers • Guarantee of atomicity – Quorum intersection and read/write protocols • Reads must write! (… sometimes as we will see later) – A reader must write the latest data – Writer cannot be trusted to complete Phased Read/Write Protocols User 2: Writer Data: User 1: Reader Data: 5 5 Q1 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 5 5 Q2 User 2 writes its data, a 5, to quorum Q1. User 3: Reader Data: 5 Phased Read/Write Protocols User 2: Writer Data: User 1: Reader Data: 5 5 User 1 queries quorum Q2, sees the latest data is a 5, and writes that back to the computer that does not have the latest data. Q1 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 5 5 5 Q2 User 3: Reader Data: 5 Data Versions & Timestamps User 2: Writer Data: User 1: Reader Data: 7,t=2 5,t=1 Timestamps allow us to distinguish among different versions of the data. Q1 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 5,t=1 7,t=2 7,t=2 Q2 User 3: Reader Data: 5,t=1 Data Versions & Timestamps User 2: Writer Data: User 1: Reader Data: 7,t=2 7,t=2 Q1 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 7,t=2 Q2 User 3: Reader Data: 7,t=2 Quorum Viability User 2: Writer Data: User 1: Reader Data: error error A weakness of the ABD algorithm is that it is dependent on a quorum of servers always being viable. When no quorum is available, then operations are blocked. Q1 Status: Status: Status: FAILED WORKING WORKING FAILED WORKING Data: Data: Data: 7,t=2 7,t=2 7,t=2 Q2 User 3: Reader Data: error The RAMBO Framework (Reconfigurable Atomic Memory for Basic Objects) Seth Gilbert Nancy Lynch Alexander Shvartsman Quorum Reconfiguration Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 Q2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 Q2 RAMBO uses quorum reconfiguration to ensure service longevity. A new quorum system (a new set of servers) is installed to replace the old ones, allowing progress in spite of failures. Replica Transfer Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 Q2 7,t=2 7,t=2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 Q2 After a new set of servers is installed, these servers do not have any information. The replica information (copies of data) must be transferred to the new configuration. Garbage Collection Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 7,t=2 Q2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 Q2 After information is transferred to the new servers, the old servers are phased out of use. This process is called `garbage collection’. The mechanism for garbage collection has two phases and is analogous to read/write operations (introduced in the next slies). Read/Write Operations Multi-Configuration Access Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 7,t=2 Q2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 Q2 User 1: Reader Data: 7,t=2 What if reads and writes occur during reconfiguration? Concurrent operations contact all existing configurations to ensure the latest information is accessed. Read/Write Operations Garbage Collection Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 7,t=2 Q2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 Q2 User 1: Reader Data: 7,t=2 Old configurations need to be removed from use. Ongoing read/write operations use their existing configuration knowledge. New operations ignore the old configuration. Research Questions Q1: Can a reader (respectively writer) avoid contacting configurations that it learned have been marked as garbage collected? Q2: When can a reader avoid its second phase, and can a reader propagate selectively? Q3: Can we propagate to the most recent configuration only? Concurrent Garbage Collection (Q1) 7,t=2 4 3 Q1 Q1 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 5,t=1 7,t=2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 1 7,t=2 7,t=2 0,t=0 Q2 6 7,t=2 0,t=0 0,t=0 Q2 5 User 1: Reader Data: 7,t=2 2 We believe that the garbage collected configuration can in fact 7 be ignored because the reader Return 7 learns of the configuration’s information regardless. Improved Configuration Management (Q1) • Authors of RAMBO conjecture that operations must contact all configurations that are discovered during the query (respectively propagate) phase. • Communicating with configurations learned to be garbage collected mid-operation is unnecessary – Intermediate discovery of garbage collected configurations from another server – That server knows at least as recent tag as any known in the old configurations • IMPACT: improves operation liveness Improved Bookkeeping (Q2) User 1: Reader Data: Q1 7,t=2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 7,t=2 7,t=2 Q2 7 7 t=2 t=2 After querying the reader learns that a majority of nodes has the up-to-date information, thus making propagation needless. Semi-Fast Read Operations (Q2) • Read operations always propagate – Regardless of the actual replica dissemination – Redundant messages and slow operation • The proposed solution – During the query phase, reader records the latest timestamps of server with which it communicated – The reader contacts servers that are not up-to-date – Sometimes this allows omitting the propagation phase entirely (`semi-fast’ read operations) • IMPACT: improves operation latency and reduces communication costs Overly Extensive Propagation (Q3) Q1 Status: FAILED Data: Q1 Status: Status: WORKING WORKING Data: Data: 7,t=2 7,t=2 Q2 Status: Status: Status: WORKING WORKING WORKING Data: Data: Data: 7,t=2 7,t=2 Q2 User 1: Writer Data: 7,t=2 Currently, RAMBO both queries and propagates to all active configurations. In fact, just the query phase covering all active configurations is sufficient for atomicity. Propagate to the Latest Configuration (Q3) • We believe it is not necessary to propagate to any configuration but the last active configuration. • Properties of configuration information • All configurations are totally ordered. • Configuration have a forward link. • Discovery is faster than reconfiguration • Operations query all active configurations • IMPACT: reduces communication cost Summary • Algorithmic optimizations • Opportunistic benefits – A clear advantage when • Servers gossip, and • Configurations have members in common • Changes are minimally intrusive – Modest increase in bookkeeping and the size of messages Future Work • Formal reasoning – Use the Input/Output Automata framework to demonstrate that the new changes preserve consistency guarantees of RAMBO • Simulation – Use the TEMPO toolkit to simulate RAMBO executions and build confidence in our proofs • Empirical experiments – Augment the existing implementations of RAMBO and collect behavior data on Planet-Lab Special Thanks to: The MIT PRIMES Program Supervisor Prof. Nancy Lynch Mentor Dr. Peter Musial