Dynamo: Amazon`s Highly Available Key

advertisement
Dynamo: Amazon’s Highly
Available Key-value Store
Giuseppe DeCandia, et.al.,
SOSP ‘07
Introduction
• Dynamo: used to manage applications that
require only primary-key access to data
• Dynamo applications need scalability, high
availability, fault tolerance, but don’t need
the complexity of a relational DB
– ACID properties => little parallelism, low
availability
Assumptions:
• Applications perform simple read/write ops on
single, small ( < 1MB) data objects which are
identified by a unique key.
– Example: the shopping cart
• Replace ACID properties with weaker
guarantees: eventual consistency, no isolation
promises
• Services must operate efficiently on commodity
hardware
• Used only by internal services, so security isn’t
an issue
Service Level Agreements (SLA)
• Clients and servers negotiate SLAs to
establish the kind of service and the
expected performance
• Amazon expects the guarantees to apply
to 99.9% of requests
– Claim that most industry systems express
SLAs in terms of “average”, “median”, and
“expected variance” – much weaker than
Amazon’s requirements
Design Considerations
• Services control properties such as durability
and consistency, evaluate tradeoffs (cost v
performance, for example)
• Replicated databases cannot guarantee
strong consistency and high availability at the
same time
– Optimistic replication updates replicas as a
background process to get eventual consistency
Design Considerations:
Resolving Conflicting Updates
• When
– Since Dynamo targets services that require
“always writeable” data storage; e.g., users
must always be able to add/delete from the
shopping cart; resolve conflicts during reads,
not writes
• By Whom
– Let each application decide for itself
– But … default is “last write wins”.
Other Key Design Principles
• Incremental scalability: adding a single
node should not affect the system
significantly
• Symmetry: all nodes have the same
responsibilities
• Decentralization: favor P2P techniques
over centralized control
• Heterogeneity: take advantage of
differences in server capabilities.
Comparison to Other Systems
• Peer-to-Peer (Freenet, Chord, …)
– Structured v unstructured: access times
– Conflict resolution for concurrent updates
without wide-area file locking
• Distributed File Systems and Databases
(Google, Bayou, Coda, …)
– Treatment of system partitions
– Conflict resolution, eventual consistency
– Strong consistency v eventual consistency
Dynamo v Other Decentralized
Storage Systems
• “always writeable”;
– updates won’t be rejected because of failure
or concurrent updates
• One administrative domain; nodes are
assumed to be trustworthy
• Don’t require hierarchical name spaces or
relational schema
• Operations must be performed within a
few hundred milliseconds.
System Architecture
• The Dynamo data storage system
contains items that are associated with a
single key
• Operations that are implemented: get( )
and put( ).
– get(key)
– put(key, context, object) where context refers
to various kinds of system metadata
Problem
Technique
Advantage
Partitioning
Consistent Hashing
Incremental scalability
High availability
for writes
Vector clocks, reconciled
during reads
Version size is decoupled
from update rates
Temporary
failures
Sloppy Quorum,
hinted handoff
Provides high availability &
durability guarantee when
some of the replicas are
not available
Permanent
failures
Anti-entropy using
Merkle trees
Synchronizes divergent replicas
in the background
Membership &
Gossip-based protocol
failure detection
Preserves symmetry and avoids
having a centralized registry for
storing membership and node
liveness information
Table 1: Summary of techniques used in Dynamo and their advantages
Partitioning Algorithm
• Partitioning = dividing data storage across
all nodes. Supports scalability
• Very similar to Chord-based schemes
• Consistent hashing scheme distributes
content across multiple nodes
– In consistent hashing the effect of adding a
node is localized – on average, K/n objects
must be remapped (K = # of keys, n = # of
nodes)
Partitioning Algorithm
• Hash function produces an m-bit number
which defines a circular name space (like
Chord)
• Nodes are assigned numbers randomly in
the name space
• Hash(data key) and assign to node using
successor function like Chord
Load Distribution
• Random assignment of node to position in
ring may produce non-uniform distribution
of data.
• Solution: virtual nodes
– Assign several random numbers to each
physical node; now it is responsible for itself
and data that would be stored on the virtual
nodes, if they existed
Replication
• Data is replicated at N nodes
• Succ(key) = coordinator node
– The coordinator replicates the object at the N-1
successor nodes in the ring, skipping virtual
nodes to increase fault tolerance
– Preference list: the list of nodes that store a
particular key
– There are actually > N nodes on the preference
list, in order to ensure N “healthy” nodes at all
times.
Data Versioning
• Updates can be propagated to replicas
asynchronously – the put( ) call may return
before all updates have been applied.
– Implication: a subsequent get( ) may return
stale data.
• Barring failure, most updates are applied
within bounded time, but server or network
failure can delay updates “for an extended
period of time”.
Data Versioning
• Some app’s can be designed to work in
this environment; e.g., the “add-to/deletefrom cart” operation.
– It’s okay to add to an old cart, as long as all
versions of the cart are eventually reconciled
• Dynamo treats each modification as a new
(& immutable) version of the object.
– Multiple versions can exist at the same time
Reconciliation
• Usually, new versions contain the old
versions – no problem
• Sometimes concurrent updates and
failures generate conflicting versions
• Typically this is handled by merging
– For add-to-cart operations, nothing is lost
– For delete-from cart, deleted items might
reappear after the reconciliation
Parallel Version Branches
• There may be multiple versions of the same
data, each coming from a different path (e.g., if
there’s been a network partition)
• Vector clocks are used to identify causally
related versions and parallel (concurrent)
versions
– For causally related versions, accept the final version
as the “true” version
– For parallel (concurrent) versions, use some
reconciliation technique to resolve the conflict
Execution of get( ) and put( )
• Operations can originate at any node in the
system.
• Clients may
– Route request through a load-balancing coordinator
node
– Use client software that routes the request directly to
the coordinator for that object
• The coordinator contacts R nodes for reading
and W nodes for writing, where R + W > N
“Sloppy Quorum”
• put( ): the coordinator writes to the first N healthy
nodes on the preference list. If W writes succeed,
the write is considered to be successful
• get( ): coordinator reads from N nodes; waits for R
responses.
– If they agree, return value.
– If they disagree, but are causally related, return the
most recent value
– If they are causally unrelated apply reconciliation
techniques and write back the corrected version
Hinted Handoff
• What if a write operation can’t reach some of the
nodes on the preference?
• To preserve availability and durability, store the
replica temporarily on another node,
accompanied by a metadata “hint” that
remembers where the replica should be stored.
• Hinted handoff ensures that read and write
operations don’t fail because of network
partitioning or node failures.
Handling Permanent Failures
• Hinted replicas may be lost before they
can be returned to the original node. Other
problems may cause replicas to be lost or
fall out of agreement
• Merkle trees allow two nodes to compare
a set of replicas and determine fairly easily
– Whether or not they are consistent
– Where the inconsistencies are
Handling Permanent Failures
• Merkle trees have leaves whose values are
hashes of the values associated with keys (one
key/leaf)
– Parent nodes contain hashes of their children
– Eventually, root contains a hash that represents
everything in that replica
• To detect inconsistency between two sets of
replicas, compare the roots
– Source of inconsistency can be detected by looking at
internal nodes
Failures
• Like Google, Amazon has a number of
data centers, each with many commodity
machines.
– Individual machines fail regularly
– Sometimes entire data centers fail due to
power outages, network partitions, tornados,
etc.
• To handle failure of entire centers, replicas
are spread across multiple data centers.
Membership and Failure Detection
• Temporary failures or accidental additions
of nodes are possible but shouldn’t cause
load re-balancing.
• Additions and deletions of nodes are
explicitly executed by an administrator.
• A gossip-based protocol is used to ensure
that every node eventually has a
consistent view of the membership list.
Gossip-based Protocol
• Periodically, each node contacts another
node in the network, randomly selected.
• Nodes compare their membership
histories and reconcile them.
Load Balancing for
Additions and Deletions
• When a node is added, it acquires key values
from other nodes in the network.
– Nodes learn of the addition through the gossip
protocol, contact the node to offer their keys, which
are then transferred after being accepted
– When a node is removed, a similar process happens
in reverse
• Experience has shown that this approach leads
to a relatively uniform distribution of key/value
pairs across the system
Summary
• Experience with Dynamo indicates that it meets
the requirements of scalability and availability.
• Service owners are able to customize their
storage system to emphasize performance,
durability, or consistency. The primary
parameters are N, R, and W.
• The developers conclude that decentralization
and eventual consistency can provide a
satisfactory platform for hosting highly-available
applications.
Download