Slides

advertisement
Coding for Atomic Shared Memory
Emulation
Viveck R. Cadambe
(MIT)
Joint with Prof. Nancy Lynch (MIT), Prof. Muriel
Médard (MIT) and Dr. Peter Musial (EMC)
Erasure Coding for Distributed Storage
Erasure Coding for Distributed Storage
• Locality, Repair Bandwidth, Caching and
Content Distribution
– [Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis
09, Niesen-Ali 12]
Erasure Coding for Distributed Storage
• Locality, Repair Bandwidth, Caching and
Content Distribution
– [Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis
09, Niesen-Ali 12]
• Queuing theory
– [Ferner-Medard-Soljanin 12, Joshi-Liu-Soljanin 12, Shah-LeeRamchandran 12]
Erasure Coding for Distributed Storage
• Locality, Repair Bandwidth, Caching and
Content Distribution
– [Gopalan et. al 2011, Dimakis-Godfrey-Wu-Wainwright- 10, Wu-Dimakis
09, Niesen-Ali 12]
• Queuing theory
– [Ferner-Medard-Soljanin 12, Joshi-Liu-Soljanin 12, Shah-LeeRamchandran 12]
This talk: Theory of distributed computing
Considerations for storing data that changes
Failure tolerance, Low storage costs, Fast reads and writes
Consistency: Value changing, get the “latest” version
6
Shared Memory Emulation - History
Atomic (consistent) shared
memory
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
7
Shared Memory Emulation - History
Atomic (consistent) shared
memory
Emulation over distributed
storage systems
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
• “ABD” algorithm [Attiya-Bar-NoyDolev95], 2011 Dijsktra Prize,
• Amazon dynamo key-value store
[Decandia et. al. 2008]
• Replication-based
8
Shared Memory Emulation - History
Atomic (consistent) shared
memory
Emulation over distributed
storage systems
Costs of emulation
(This talk)
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
• “ABD” algorithm [Attiya-Bar-NoyDolev95], 2011 Dijsktra Prize,
• Amazon dynamo key-value store
[Decandia et. al. 2008]
• Replication-based
• Low cost coding based algorithm
• Communication and storage costs
• [C-Lynch-Medard-Musial 2014],
9
preprint available
Shared Memory Emulation - History
Atomic (consistent) shared
memory
Emulation over distributed
storage systems
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
• “ABD” algorithm [Attiya-Bar-NoyDolev95], 2011 Dijsktra Prize,
• Amazon dynamo key-value store
[Decandia et. al. 2008]
• Replication-based
• Low cost coding based algorithm
Costs of emulation
(This talk)
• Communication and storage costs
• [C-Lynch-Medard-Musial 2014],
preprint available
10
11
Write
Read
time
12
Write
Read
time
13
Atomicity
[Lamport 86]
aka linearizability. [Herlihy, Wing 90]
Write
Read
time
14
Atomicity
[Lamport 86]
aka linearizability. [Herlihy, Wing 90]
Write
Read
time
15
Atomicity
[Lamport 86]
aka linearizability. [Herlihy, Wing 90]
Write
Read
time
16
Atomicity
[Lamport 86]
aka linearizability. [Herlihy, Wing 90]
Atomic
Write
Read
time
17
Atomicity
[Lamport 86]
aka linearizability. [Herlihy, Wing 90]
Atomic
Write
time
Read
Not atomic
time
time
18
Shared Memory Emulation - History
Atomic (consistent) shared
memory
Emulation over distributed
storage systems
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
• “ABD” algorithm [Attiya-Bar-NoyDolev95], 2011 Dijsktra Prize,
• Amazon dynamo key-value store
[Decandia et. al. 2008]
• Replication-based
• Low cost coding based algorithm
Costs of emulation
(This talk)
• Communication and storage costs
• [C-Lynch-Medard-Musial 2014],
preprint available
19
Distributed Storage Model
Read Clients
Write Clients
Servers
• Client server architecture, nodes can fail (no. of server
failures is limited)
• Point-to-point reliable links (arbitrary delay).
• Nodes do not know if other nodes fail
• An operation should not have to wait for others to complete
20
Distributed Storage Model
Read Clients
Write Clients
Servers
• Client server architecture, nodes can fail (no. of server
failures is limited)
• Point-to-point reliable links (arbitrary delay)
• Nodes do not know if other nodes fail
• An operation should not have to wait for others to complete
21
Distributed Storage Model
Read Clients
Write Clients
Servers
• Client server architecture, nodes can fail (no. of server
failures is limited)
• Point-to-point reliable links (arbitrary delay).
• Nodes do not know if other nodes fail
• An operation should not have to wait for others to complete
22
Requirements and cost measure
Read Clients
Write Clients
Servers
Design write, read and server protocols such that
• Atomicity
• Concurrent operations, no waiting.
Communication overheads: Number of bits sent over links
Storage overheads: (Worst-case) server storage costs
23
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Quorum set: Every majority of server snodes.
Any two sets intersect at at least one nodes
Algorithm works if at least one quorum set is available.
24
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write:
Send time-stamped value to every server; return after receiving sufficeint acks.
Read:
Send read query; wait for sufficient responses and return with latest value.
Servers:
Store latest value from server; send ack
Respond to read request with value
25
The ABD algorithm (sketch)
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Write:
Send time-stamped value to every server; return after receiving acks from quorum.
Read::
Send read query; wait for sufficient responses and return with latest value.
Servers:
Store latest value; send ack
Respond to read request with value
26
The ABD algorithm (sketch)
Query
Query
Query
Read Clients
Query
Write Clients
Servers
Write:
Send time-stamped value to every server; return after receiving sufficeint acks.
Read:
Send read query; wait for sufficient responses and return with latest value.
Servers:
Store latest value from server; send ack
Respond to read request with value
27
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write:
Send time-stamped value to every server; return after receiving sufficeint acks.
Read:
Send read query; wait for quorum of responses; return with latest value.
Servers:
Store latest value from server; send ack
Respond to read request with value
28
The ABD algorithm (sketch)
Read Clients
Write Clients
Servers
Write:
Send time-stamped value to every server; return after receiving sufficeint acks.
Read:
Send read query; wait for quorum responses; send latest value to quourm; latest value
Servers:
Store latest value from server; send ack
Respond to read request with value
29
The ABD algorithm (sketch)
ACK
ACK
ACK
Read Clients
Write Clients
ACK
ACK
ACK
Servers
Write:
Send time-stamped value to every server; return after receiving sufficeint acks.
Read:
Send read query; wait for acks from quorum responses; send latest value to servers;
return latest value after receiving acks from quorum.
Servers:
Store latest value from server; send ack
30
Respond to read request with value
The ABD algorithm (summary)
• The ABD algorithm ensures atomic operations.
• Operations terminate is ensured as long as a majority
of nodes do not fail.
• Implication: A networked distributed storage system
can be used as shared memory.
• Replication to ensure failure tolerance.
Performance Analysis
ABD
Storage
Communication
(write)
Communication
(read)
• f represents number of failures
• a lower communication cost algorithm in [Fan-Lynch 03]
Shared Memory Emulation - History
Atomic (consistent) shared
memory
Emulation over distributed
storage systems
Costs of emulation
(This talk)
• [Lamport 1986]
• Cornerstone of distributed computing
and multi-processor programming
• “ABD” algorithm [Attiya-Bar-NoyDolev95], 2011 Dijsktra Prize,
• Amazon dynamo key-value store
[Decandia et. al. 2008]
• Replication-based
• Low cost coding based algorithm
• Communication and storage costs
• [C-Lynch-Medard-Musial 2014],
preprint available
33
Shared Memory Emulation – Erasure Coding
• [Hendricks-Ganger-Reiter 07, Dutta-Guerraoui-Levy
08, Dobre-et.al 13, Androulaki et. al 14]
• New algorithm, a formal analysis of costs
• Outperforms previous algorithms in certain aspects
• Previous algorithms incur infinite worst-case storage costs
• Previous algorithms incur large communication costs
Erasure Coded Shared Memory
35
Erasure Coded Shared Memory
Smaller packets,
smaller overheads
Example:
(6,4) MDS code
• Value recoverable from any 4 coded packets
• Size of coded packet is ¼ size of value
36
Erasure Coded Shared Memory
Smaller packets,
smaller overheads
Example:
(6,4) MDS code
• Value recoverable from any 4 coded packets
• Size of coded packet is ¼ size of value
• New constraint, need 4 packets with same timestamp
37
Coded Shared Memory – Quorum set up
Read Clients
Write Clients
Servers
Quorum set: Every subset of 5 server snodes.
Any two sets intersect at 4 nodes
Algorithm works if at least one quorum set is available.
38
Coded Shared Memory – Why is it challenging?
Read Clients
Write Clients
Servers
39
Coded Shared Memory – Why is it challenging?
Query
QueryRead Clients
Write Clients
Servers
Query
Servers store multiple versions
Challenges: reveal elements to readers only when enough elements are propagat
discard old versions safely
Solutions: Write in multiple phases
Store all the write-versions concurrent with a read
40
Coded Shared Memory – Protocol overview
Write:
Send time-stamped value to every server; send finalize message after getting acks
from quorum; return after receiving acks from quorum.
Read:
Send read query; wait for time-stamps from a quorum;
Send request with latest time-stamp to servers;
decode and return value after receiving acks from quorum.
Servers:
Store the coded symbol; keep latest δ codeword symbols and delete older ones; send
ack.
Set finalize flag for tag on receiving finalize message.
Respond to read query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
Coded Shared Memory – Protocol overview
Write:
Send time-stamped value to every server; send finalize message after getting acks
from quorum; return after receiving acks from quorum.
Read:
Send read query; wait for time-stamps from a quorum;
Send request with latest time-stamp to servers;
decode and return value after receiving acks from quorum.
Servers:
Store the coded symbol; keep latest δ codeword symbols and delete older ones; send
ack.
Set finalize flag for time-stamp on receiving finalize message. Send ack.
Respond to read query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
Coded Shared Memory – Protocol overview
Write:
Send time-stamped value to every server; send finalize message after getting acks
from quorum; return after receiving acks from quorum.
Read:
Send read query; wait for time-stamps from a quorum;
Send request with latest time-stamp to servers;
decode and return value after receiving acks from quorum.
Servers:
Store the coded symbol; keep latest δ codeword symbols and delete older ones; send
ack.
Set finalize flag for tag on receiving finalize message.
Respond to read query with latest finalized tag.
Finalize the requested tag; respond to read request with codeword symbol.
Coded Shared Memory – Protocol overview
Write:
Send time-stamped value to every server; send finalize message after getting acks
from quorum; return after receiving acks from quorum.
Read:
Send read query; wait for time-stamps from a quorum;
Send request with latest time-stamp to servers;
decode and return value after receiving acks/symbols from quorum.
Servers:
Store the coded symbol; keep latest δ codeword symbols and delete older ones; send
ack.
Set finalize flag for tag on receiving finalize message.
Respond to read query with latest finalized tag.
Finalize the requested time-stamp; respond to read request with codeword symbol if it
exists, else send ack.
Coded Shared Memory – Protocol overview
Write:
Send time-stamped value to every server; send finalize message after getting acks
from quorum; return after receiving acks from quorum.
Read:
Send read query; wait for time-stamps from a quorum;
Send request with latest time-stamp to servers;
decode and return value after receiving acks/symbols from quorum.
Servers:
Store the coded symbol; keep latest δ codeword symbols and delete older ones; send
ack.
Set finalize flag for time-stamp on receiving finalize message.
Respond to read query with latest finalized tag.
Finalize the requested time-stamp; respond to read request with codeword symbol if it
exists, else send ack.
Coded Shared Memory – Protocol overview
• Use (N,k) MDS code, where N is the number of
servers
• Ensures atomic operations
• Operations terminate is ensured as long as
o Number of failed nodes smaller than (N-k)/2
o Number of writes concurrent with a read smaller
than δ
Performance comparisons
ABD
Our
Algorithm
Storage
Communication
(write)
Communication
(read)
• N represents number of nodes, f represents number of failures
• δ represents maximum number of writes concurrent with a read
Proof Steps
• After every operation terminates,
- there is a quorum of servers with the codeword symbol
- there is a quorum of servers with the finalize label
- because every pair of servers intersects in k servers, readers can
decode
the value
48
Proof Steps
• After every operation terminates,
- there is a quorum of servers with the codeword symbol
- there is a quorum of servers with the finalize label
- because every pair of servers intersects in k servers, readers can
decode
the value
• When a codeword symbol is deleted at a server
– Every operation that wants that time-stamp has terminated
– (Or the concurrency bound is violated)
49
Main Insights
• Significant savings on network traffic overheads
- Reflects the classical gain of erasure coding over replication
• (New Insight) Storage overheads depend on client
activity
• Storage overhead proportional to the no. of writes concurrent with a
read
• Better than classical techniques for moderate client activity
50
Future Work – Many open questions
Refinements of our algorithm
- (Ongoing) More robustness to client node failures
Information theoretic bounds on costs
- New coding schemes
Finer network models
-
Erasure channels, different topologies, wireless channels
Finer source models
-
Correlations across versions
Dynamic networks
51
Future Work – Many open questions
Refinements of our algorithm
- (Ongoing) More robustness to client node failures
Information theoretic bounds on costs
- New coding schemes
Finer network models
-
Erasure channels, different topologies, wireless channels
Finer source models
-
Correlations across versions
Dynamic networks
52
Storage costs
1.8
1.6
Our algorithm
1.4
ABD
1.2
Storage
Overhead1
0.8
What is the fundamental
cost curve?
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
Number of writes concurrent with a
read
9
10
53
Future Work – Many open questions
Refinements of our algorithm
- (Ongoing) More robustness to client node failures
Information theoretic bounds on costs
- New coding schemes
Finer network models, finer source models
-
Erasure channels, different topologies, wireless channels
Correlations across versions
Dynamic networks
54
Future Work – Many open questions
Refinements of our algorithm
- (Ongoing) More robustness to client node failures
Information theoretic bounds on costs
- New coding schemes
Finer network models, finer source models
-
Erasure channels, different topologies, wireless channels
Correlations across versions
Dynamic networks
- Interesting replication based algorithm in [Gilbert-Lynch-Shvartsman 03]
- Study of costs in terms of network dynamics
55
Download