Improving the efficiency of fault-tolerant distributed shared

advertisement
Improving the Efficiency of
Fault-Tolerant Distributed
Shared-Memory Algorithms
Eli Sadovnik and Steven Homberg
Second Annual MIT PRIMES
Conference, May 19-20, 2012
Introduction
• Shared memory supports concurrent access
– Read & write interface
• Memory models: single writer, multiple reader (SWMR)
and multiple writer, multiple reader (MWMR)
– Consistency is important
• Strong consistency provides useful semantics
• Abstraction for message-passing networks
– Shared memory can be emulated
– Difficult to do, but solutions exist
– For example applications for the Internet, such as Dropbox
Our Research Project
THE RAMBO PROJECT
• Framework for emulating shared memory
– Introduced by Lynch and Shvartsman, extended by Gilbert
– Implements the MWMR model with strong consistency
– Designed for dynamic distributed message-passing settings
OUR GOAL
• RAMBO is elegant but not always efficient
• Extend RAMBO with intelligent data management
Consistency & Atomicity
• There are many consistency models
• We are interested in atomicity
write(8)
Atomicity
0
time
read(8)
Violation
(Safety)
read(8)
write(8)
0
time
(Regularity)
Violation
read(8)
read(8)
read(0)
read(8)
write(8)
0
time
read(3)
read(0)
read(8)
Emulating Shared Memory
User 2:
Writer
Data:
User 1:
Reader
Data:
5
5
Status:
WORKING
Data:
5
User 3:
Reader
Data:
5
Weakness of the Centralized Approach
User 2:
Writer
Data:
User 1:
Reader
Data:
error
error
Status:
FAILED
Data:
User 3:
Reader
Data:
error
Replication in Distributed Setting
User 2:
Writer
Data:
User 1:
Reader
Data:
5
5
Status:
Status:
Status:
FAILED
WORKING
WORKING
Data:
Data:
Data:
5
5
User 3:
Reader
Data:
5
The ABD Algorithm
Hagit Attiya, Amotz Bar-Noy, Danny Dolev
A SWMR algorithm
• Operation level wait-freedom
– Termination unaffected by concurrency
• Designed for a message-passing setting
– Allows limited failures
– Communication is reliable
– Messages can be delayed
Quorum Systems and ABD
• ABD is a quorum based algorithm
– Quorum system is a collection of intersecting sets
• For example a voting majority quorum system
• Data is replicated in a quorum systems
– Quorum system members are networked servers
• Guarantee of atomicity
– Quorum intersection and read/write protocols
• Reads must write! (… sometimes as we will see later)
– A reader must write the latest data
– Writer cannot be trusted to complete
Phased Read/Write Protocols
User 2:
Writer
Data:
User 1:
Reader
Data:
5
5
Q1
Status:
Status:
Status:
WORKING
WORKING
WORKING
Data:
Data:
Data:
5
5
Q2
User 2 writes
its data, a 5, to
quorum Q1.
User 3:
Reader
Data:
5
Phased Read/Write Protocols
User 2:
Writer
Data:
User 1:
Reader
Data:
5
5
User 1 queries
quorum Q2,
sees the latest
data is a 5,
and writes
that back to
the computer
that does not
have the latest
data.
Q1
Status:
Status:
Status:
WORKING
WORKING
WORKING
Data:
Data:
Data:
5
5
5
Q2
User 3:
Reader
Data:
5
Data Versions & Timestamps
User 2:
Writer
Data:
User 1:
Reader
Data:
7,t=2
5,t=1
Timestamps allow
us to distinguish
among different
versions of
the data.
Q1
Status:
Status:
Status:
WORKING
WORKING
WORKING
Data:
Data:
Data:
5,t=1
7,t=2
7,t=2
Q2
User 3:
Reader
Data:
5,t=1
Data Versions & Timestamps
User 2:
Writer
Data:
User 1:
Reader
Data:
7,t=2
7,t=2
Q1
Status:
Status:
Status:
WORKING
WORKING
WORKING
Data:
Data:
Data:
7,t=2
7,t=2
7,t=2
Q2
User 3:
Reader
Data:
7,t=2
Quorum Viability
User 2:
Writer
Data:
User 1:
Reader
Data:
error
error
A weakness of
the ABD algorithm
is that it is
dependent on
a quorum of
servers
always being
viable. When no
quorum is
available, then
operations are
blocked.
Q1
Status:
Status:
Status:
FAILED
WORKING
WORKING
FAILED
WORKING
Data:
Data:
Data:
7,t=2
7,t=2
7,t=2
Q2
User 3:
Reader
Data:
error
The RAMBO Framework
(Reconfigurable Atomic Memory
for Basic Objects)
Seth Gilbert
Nancy Lynch
Alexander Shvartsman
Quorum Reconfiguration
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
Q2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
Q2
RAMBO uses quorum reconfiguration to ensure service longevity.
A new quorum system (a new set of servers) is installed to replace the old ones, allowing
progress in spite of failures.
Replica Transfer
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
Q2
7,t=2
7,t=2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
7,t=2
Q2
After a new set of servers is installed, these servers do not have any information.
The replica information (copies of data) must be transferred to the new configuration.
Garbage Collection
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
7,t=2
Q2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
7,t=2
Q2
After information is transferred to the new servers, the old servers are phased out of use.
This process is called `garbage collection’.
The mechanism for garbage collection has two phases and is analogous to read/write
operations (introduced in the next slies).
Read/Write Operations
Multi-Configuration Access
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
7,t=2
Q2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
7,t=2
Q2
User 1:
Reader
Data:
7,t=2
What if reads and writes occur during
reconfiguration?
Concurrent operations contact all
existing configurations to ensure the
latest information is accessed.
Read/Write Operations
Garbage Collection
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
7,t=2
Q2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
7,t=2
Q2
User 1:
Reader
Data:
7,t=2
Old configurations need to be removed
from use.
Ongoing read/write operations use their
existing configuration knowledge. New
operations ignore the old configuration.
Research Questions
Q1: Can a reader (respectively writer) avoid contacting
configurations that it learned have been marked as
garbage collected?
Q2: When can a reader avoid its second phase, and can
a reader propagate selectively?
Q3: Can we propagate to the most recent configuration
only?
Concurrent Garbage Collection (Q1)
7,t=2
4
3
Q1
Q1
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
5,t=1
7,t=2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
1
7,t=2
7,t=2
0,t=0
Q2
6
7,t=2
0,t=0
0,t=0
Q2
5
User 1:
Reader
Data:
7,t=2
2
We believe that the garbage
collected configuration can in fact
7
be ignored because the reader
Return 7 learns of the configuration’s
information regardless.
Improved Configuration Management (Q1)
• Authors of RAMBO conjecture that operations must
contact all configurations that are discovered during
the query (respectively propagate) phase.
• Communicating with configurations learned to be
garbage collected mid-operation is unnecessary
– Intermediate discovery of garbage collected configurations
from another server
– That server knows at least as recent tag as any known in
the old configurations
• IMPACT: improves operation liveness
Improved Bookkeeping (Q2)
User 1:
Reader
Data:
Q1
7,t=2
Status:
Status:
Status:
WORKING
WORKING
WORKING
Data:
Data:
Data:
7,t=2
7,t=2
7,t=2
7,t=2
Q2
7
7
t=2 t=2
After querying the reader
learns that a majority of
nodes has the up-to-date
information, thus making
propagation needless.
Semi-Fast Read Operations (Q2)
• Read operations always propagate
– Regardless of the actual replica dissemination
– Redundant messages and slow operation
• The proposed solution
– During the query phase, reader records the latest
timestamps of server with which it communicated
– The reader contacts servers that are not up-to-date
– Sometimes this allows omitting the propagation phase
entirely (`semi-fast’ read operations)
• IMPACT: improves operation latency and reduces
communication costs
Overly Extensive Propagation (Q3)
Q1
Status:
FAILED
Data:
Q1
Status:
Status:
WORKING WORKING
Data:
Data:
7,t=2
7,t=2
Q2
Status:
Status:
Status:
WORKING WORKING WORKING
Data:
Data:
Data:
7,t=2
7,t=2
Q2
User 1:
Writer
Data:
7,t=2
Currently, RAMBO both queries and
propagates to all active
configurations. In fact, just the
query phase covering all active
configurations is sufficient for
atomicity.
Propagate to the Latest
Configuration (Q3)
• We believe it is not necessary to propagate to any
configuration but the last active configuration.
• Properties of configuration information
• All configurations are totally ordered.
• Configuration have a forward link.
• Discovery is faster than reconfiguration
• Operations query all active configurations
• IMPACT: reduces communication cost
Summary
• Algorithmic optimizations
• Opportunistic benefits
– A clear advantage when
• Servers gossip, and
• Configurations have members in common
• Changes are minimally intrusive
– Modest increase in bookkeeping and the size of
messages
Future Work
• Formal reasoning
– Use the Input/Output Automata framework to
demonstrate that the new changes preserve consistency
guarantees of RAMBO
• Simulation
– Use the TEMPO toolkit to simulate RAMBO executions and
build confidence in our proofs
• Empirical experiments
– Augment the existing implementations of RAMBO and
collect behavior data on Planet-Lab
Special Thanks to:
The MIT PRIMES Program
Supervisor Prof. Nancy Lynch
Mentor Dr. Peter Musial
Download