PPT

advertisement
Dynamo: Amazon’s Highly
Available Key-value Store
Presented By: Devarsh Patel
CS5204 – Operating Systems
1
Dynamo
Introduction

Amazon’s e-commerce platform


Requires performance, reliability and efficiency
To support continuous growth, platform needs to be highly scalable

Dynamo – A highly available and scalable distributed data store built
for Amazon’s platform

Dynamo is used to manage services that have very high reliability
requirements and need tight control over the tradeoffs between
availability, consistency, cost-effectiveness and performance.

Dynamo provides a simple primary-key only interface to meet
requirements of applications like best seller lists, shopping carts,
customer preferences, session management, etc.

A completely decentralized system with minimal need for manual
administration.
CS5204 – Operating Systems
2
Dynamo
System Assumptions and Requirements

Simple key-value interface





Each service that uses Dynamo runs its own Dynamo
instances
Used only by Amazon’s internal services




Highly available
Efficient in resource usage
Simple scale out scheme to address growth in data set size or
request rates
Non-hostile environment
No security requirements like authentication and authorization
Targets applications that operate with weaker consistency in
favor of high availability
Service level agreements (SLA)




Measured at the 99.9th percentile of the distribution
Key factors: service latency at a given request rate
Example: response time of 300ms for 99.9% of requests at peak
client load of 500 requests per second
State management is the main component of a service’s SLAs
CS5204 – Operating Systems
3
Dynamo
Design Considerations



Designed to be an eventually consistent data store
“Always writeable” data store
Consistency vs. availability
To achieve a level of consistency, replication algorithms are
forced to tradeoff the availability of the data under certain failure
scenarios.
 To improve availability,



Dynamo uses weaker form of consistency (eventual consistency)
Allows optimistic replication techniques



Can lead to conflicting changes which must be detected and
resolved
Data store or application performs conflict resolution to the reads
Other key principles




Incremental scalability – One storage node at a time
Symmetry – Every node has same set of responsibilities
Decentralization – Favor decentralized peer-to-peer techniques
Heterogeneity – Work distribution must be proportional
CS5204 – Operating Systems
4
Dynamo
System Architecture

Core distributed system techniques used in Dynamo:

Partitioning, Replication, Versioning, Membership, Failure
handling and Scaling
CS5204 – Operating Systems
5
Dynamo
System Interface





Two operations: get() and put()
get(key) – Locates the object replicas associated with
the key in the storage system and returns a single
object or a list of objects with conflicting versions
along with a context
put(key, context, object) - Determines where the
replicas of the object should be placed based on the
associated key, and writes the replicas to disk
context – encodes system metadata about the object
MD5 hash on the key generates 128-bit identifier to
identify storage nodes
CS5204 – Operating Systems
6
Dynamo
Partitioning Algorithm



Consistent Hashing
Output range is a fixed circular space or “ring”
Advantage


Issues


Departure or arrival of a node only affects
immediate neighbors
Non-uniform data and load distribution
Dynamo uses a variant of consistent hashing by
using concept of “virtual nodes”
CS5204 – Operating Systems
7
Dynamo
Replication

Replicate data on multiple hosts



Reason – To achieve high availability and durability
“per-instance”
Preference list – List of
nodes responsible for
storing particular key
Figure 1: Partitioning and replication of keys in
Dynamo ring.
CS5204 – Operating Systems
8
Dynamo
Data Versioning



Dynamo treats the result of each modification as a
new and immutable version of the data
Allows for multiple versions of an object to be
present in the system at the same time.
Problem


Version branching due to failures combined with
concurrent updates, resulting in conflicting versions
of object
Updates in the presence of network partitions and
node failures result in an object having distinct
version sub-histories
CS5204 – Operating Systems
9
Dynamo
Data Versioning





Uses vector clocks – A list of (node, counter)
pairs
Determines two version of an object are on
parallel branches or have causal ordering
Conflict requires reconciliation
Conflicting versions passed to application as
output of get operation
Application resolves conflicts and puts a new
(consistent) version
CS5204 – Operating Systems
10
Dynamo
Data Versioning
Figure: Version evolution of an object over time
CS5204 – Operating Systems
11
Dynamo
Execution of get/put operations

Two strategies to select a node:



Coordinator – Node handling read and write operation



First among the top N nodes in the preference list
Quorum system







Request through a load balancer
Request directly to the coordinator nodes
Two key configurable values: R and W
R - minimum nodes participated in successful read operation
W - minimum nodes participated in successful write operation
Quorum like system requires, R+W > N
(N, R, W) can be chosen to achieve desired tradeoff
R and W are usually configured to be less than N, to provide
better latency.
Write is successful – If W-1 nodes respond to put() request
Read is successful – If R noes respond to get() request
CS5204 – Operating Systems
12
Dynamo
Hinted Handoff

“Sloppy quorum”





All read and write operations are done on Top N
healthy nodes in the preference list
Coordinator is first in this group
Replicas sent to node will have a “hint” in its
metadata indicating the original node that should
hold the replica
Hinted replicas are stored by available node and
sent forwarded when original node recovers.
Ensures read and write operations are not failed
due to node or network failures
CS5204 – Operating Systems
13
Dynamo
Replica synchronization



Detect the inconsistencies between replicas faster
and to minimize the amount of transferred data
using Merkle tree.
Separate tree maintained by each node for each
key range
Advantage:


each branch of the tree can be checked
independently without requiring nodes to download
the entire tree or the entire data set
Disadvantage:

Adds overhead to maintain Merkle trees when a
node joins or leaves the system
CS5204 – Operating Systems
14
Dynamo
Membership and Failure Detection

Ring Membership





External Discovery



Explicit mechanism to add or remove node from a ring
Done by administrator using command line tool or browser
Gossip-based protocol propagates membership, partitioning,
and placement information via periodic exchanges
Nodes eventually know key ranges of its peers and can
forward requests to them
To prevent logical partitions, some nodes play role of seeds
“Seed” nodes discovered via external mechanism are known
to all nodes
Failure Detection

Nodes failures are detected by lack of responsiveness and
recovery detected by periodic retry
CS5204 – Operating Systems
15
Dynamo
Experiences & Lessons Learned

Main patterns in which Dynamo is used:





Business logic specific reconciliation
Timestamp based reconciliation
High performance read engine
Client applications can tune values of N, R and W
Common (N,R,W) configuration used by several
instances of Dynamo is (3,2,2)
CS5204 – Operating Systems
16
Dynamo
Experiences & Lessons Learned

Balancing performance and Durability
CS5204 – Operating Systems
17
Dynamo
Experiences & Lessons Learned

Ensuring Uniform Load Distribution
CS5204 – Operating Systems
18
Dynamo
Partitioning & Placement Strategies
Partitioning and placement of keys in the three strategies. A, B, and C
depict the three unique nodes that form the preference list for the key
k1 on the consistent hashing ring (N=3). The shaded area indicates the
key range for which nodes A, B, and C form the preference list. Dark
arrows indicate the token locations for various nodes.
CS5204 – Operating Systems
19
Dynamo
Partitioning & Placement Strategies

Strategy 1

T random tokens per node and partition by token
value:
It needs to steal its key ranges from other nodes
 Bootstrapping of new node is lengthy
 Other nodes process scanning/transmission of key
ranges for new node as background activities
 Disadvantages:



Numerous nodes have to adjust their Merkle trees
when a new node joins or leaves system
Archiving entire key space is highly inefficient
CS5204 – Operating Systems
20
Dynamo
Partitioning & Placement Strategies

Strategy 2

T random tokens per node and equal sized partitions:



Divided into Q equally sized partitions
Q >> N and Q >> S*T, where S is no. of nodes in the system
Advantages:



Decoupling of partition and partition placement
Allows changing of placement scheme at run-time
Strategy 3

Q/S tokens per node, equal sized partitions:


Decoupling of partition and placement
Advantages:


Faster bootstrapping/recovery
Ease of archival
CS5204 – Operating Systems
21
Dynamo
Partitioning & Placement Strategies



Strategies have different tuning parameters
Fair way to compare strategies is to evaluate the skew in their load
distributions for a fixed amount of space to maintain membership information
Strategy 3 achieves best load balancing efficiency
CS5204 – Operating Systems
22
Dynamo
Client-driven or Server-driven Coordination



Any node can coordinate read requests; write requests handled by coordinator
State-machine for coordination can be in load balancing server or incorporated
into client
Client-driven coordination has lower latency because it avoids extra network
hop (redirection)
CS5204 – Operating Systems
23
Dynamo
Thank You
CS5204 – Operating Systems
24
Download