Dynamo Highly Available Key-Value Store Dennis Kafura – CS5204 – Operating Systems 1 Dynamo Context Core e-commerce services need scalable and reliable storage for massive amounts of data Size and scalability require a storage architecture that is n x 100 of services n x 100,000 concurrent sessions on key services highly decentralized high component count commodity hardware High component count creates reliability problems (“treats failure handling as the normal case”) Address reliability problems by replication Replication raises issues of: Consistency (replicas differ after failure) Performance When to enforce consistency (on read, on write) Who enforces consistency (client, storage system) Dennis Kafura – CS5204 – Operating Systems 2 Dynamo System Elements Maintain state of services with Used only internally High reliability requirements Latency-sensitive performance Control tradeoff between consistency and performance Can leverage characteristics of services and workloads Non-hostile environment (no security requirements) Simple key-value interface Applications do not require more complicated (e.g. database) semantics or hierarchical name space Key is unique identifier for data item; Value is a binary object (blob) No operations over multiple data items Adopts weaker model of consistency (eventual consistency) in favor of higher availability Service level agreements (SLA) At 99.9% percentile Key factors: service latency at a given request rate Example: response time of 300ms for 99.9% of requests at peak client load of 500 requests per second State manage (storage) efficiency a key factor in SLAs Dennis Kafura – CS5204 – Operating Systems 3 Dynamo Design Considerations Consistency vs. availability Strict consistency means that data is unavailable in case of failure to one of the replicas To improve availability, use weaker form of consistency (eventual consistency) allow optimistic updates (changes propagate in the background) Conflicts Can lead to conflicting changes which must be detected and resolved Dynamo applications require “always writeable” storage Perform conflict detection/resolution on reads Other factors Incremental scalability Symmetry/decentralization (P2P organization/control) Heterogeneity (not all servers the same) Dennis Kafura – CS5204 – Operating Systems 4 Dynamo Design Overview Dennis Kafura – CS5204 – Operating Systems 5 Dynamo Partitioning Interface get(key) Returns context and A single object or a list of conflicting objects put(key, context, object) Context from previous read Object placement/replication MD5 hash of key yields 128 bit identifier Consistent hashing preference list Dennis Kafura – CS5204 – Operating Systems 6 Dynamo Versioning Failure free operation put replicas What to do in case of failure? ? put Dennis Kafura – CS5204 – Operating Systems replicas 7 Dynamo Versioning Object content is treated as immutable and an update operation creates a new version put v2 v1 v2 v1 v2 Dennis Kafura – CS5204 – Operating Systems v1 8 Dynamo Versioning Versioning can lead to inconsistency Due to network partitioning put v2 v1 v2 v1 v1 Dennis Kafura – CS5204 – Operating Systems 9 Dynamo Versioning Versioning can lead to inconsistency Due to concurrent updates v2b v2a v1 puta v2a v2b v1 putb v2b v2a Dennis Kafura – CS5204 – Operating Systems v1 10 Dynamo Object Resolution Uses vector-clocks Conflicting versions passed to application as output of get operation Application resolves conflicts and puts a new (consistent) version Inconsistent version rare: 99.94% of get operations saw exactly one version Dennis Kafura – CS5204 – Operating Systems 11 Dynamo Handling get/put operations Operating handled by coordinator: First among the top N nodes in the preference list Located by Quorum voting call to load balancer (no Dynamo-specific node needed in application but may require extra level of indirection) Direct call to coordinator (via Dynamo-specific client library) R nodes must agree to a get operation W nodes must agree to a put operation R+W > N (N, R, W) can be chosen to achieve desired tradeoff Common configuration is (3,2,2) “Sloppy quorum” Top N’ healthy nodes in the preference list Coordinator is first in this group Replicas sent to node contain a “hint” indicating the (unavailable) original node that should hold the replica Hinted replicas are stored by available node and sent forwarded when original node recovers. Dennis Kafura – CS5204 – Operating Systems 12 Dynamo Replica synchronization Accelerates detection of inconsistent replicas using Merkle tree Separate tree maintained by each node for each key range Adds overhead to maintain Merkle trees Dennis Kafura – CS5204 – Operating Systems 13 Dynamo Ring membership Nodes are explicitly added to/removed from a ring Membership, partitioning, and placement information propagates via periodic exchanges (a gossip protocol) Existing nodes transfer key ranges to newly added node or receive key ranges from exiting nodes Nodes eventually know key ranges of its peers and can forward requests to them Some “seed” nodes are well-known Nodes failures detected by lack of responsiveness and recovery detected by periodic retry Dennis Kafura – CS5204 – Operating Systems 14 Dynamo Partition/Placement Strategies Strategy Placement Partition 1 T random tokens per node Consecutive tokens create a partition 2 T random tokens per node Q equal sized partitions 3 Q/S tokens per node Q equal sized partitions S = number of nodes Dennis Kafura – CS5204 – Operating Systems 15 Dynamo Strategy Performance Factors Strategy 1 Bootstrapping of new node is lengthy It must acquire its key ranges from other nodes Other nodes process scanning/transmission of key ranges for new node as background activities Has taken a full day during peak periods Numerous nodes many have to adjust their Merkle trees when a new node joins/leaves system Archival process difficult Key ranges may be in transit No obvious synchronization/checkpointing structure Dennis Kafura – CS5204 – Operating Systems 16 Dynamo Strategy Performance Factors Strategy 2 Decouples partition and placement Allows changing of placement scheme at run-time Strategy 3 Decouples partition and placement Faster bootstrapping/recovery and ease of archiving because key ranges can be segregates into different files that can be shared/archived separately Dennis Kafura – CS5204 – Operating Systems 17 Dynamo Partition Strategies - Performance Strategies have different tuning parameters Fair comparison: evaluate the skew in their load distributions for a fixed amount of space to maintain membership information Strategy 3 superior Dennis Kafura – CS5204 – Operating Systems 18 Dynamo Client- vs Server-Side Coordination Any node can coordinate read requests; write requests handled by coordinator State-machine for coordination can be in load balancing server or incorporated into client Client-driven coordination has lower latency because it avoids extra network hop (redirection) Dennis Kafura – CS5204 – Operating Systems 19