DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 AMAZON’S KEY-VALUE STORE: DYNAMO Adapted from Amazon’s Dynamo Presentation UCSB CS271 1 Motivation • • • • Reliability at a massive scale Slightest outage significant financial consequences High write availability Amazon’s platform: 10s of thousands of servers and network components, geographically dispersed • Provide persistent storage in spite of failures • Sacrifice consistency to achieve performance, reliability, and scalability UCSB CS271 2 Dynamo Design rationale • Most services need key-based access: – Best-seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, and so on. • Prevalent application design based on RDBMS technology will be catastrophic. • Dynamo therefore provides primary-key only interface. UCSB CS271 3 Dynamo Design Overview • • • • • Data partitioning using consistent hashing Data replication Consistency via version vectors Replica synchronization via quorum protocol Gossip-based failure-detection and membership protocol UCSB CS271 4 System Requirements • Data & Query Model: – Read/write operations via primary key – No relational schema: use <key, value> object – Object size < 1 MB, typically. • Consistency guarantees: – Weak – Only single key updates – Not clear if read-modify-write isolate • Efficiency: – SLA 99.9 percentile of operations • Notes: – Commodity hardware – Minimal security measures since for internal use UCSB CS271 5 Service Level Agreements (SLA) • Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. • Example SLA: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. UCSB CS271 6 System Interface • Two basic operations: – Get(key): • Locates replicas • Returns the object + context (encodes meta data including version) – Put(key, context, object): • Writes the replicas to the disk • Context: version (vector timestamp) • Hash(key) 128-bit identifier UCSB CS271 7 Partition Algorithm • Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring” a la Chord. • “Virtual Nodes”: Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution) UCSB CS271 8 Virtual Nodes UCSB CS271 9 Advantages of using virtual nodes • The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure. • A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node. • If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. • When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. UCSB CS271 10 Replication • Each data item is replicated at N hosts. • preference list: The list of nodes that is responsible for storing a particular key. • Some fine-tuning to account for virtual nodes UCSB CS271 11 Replication UCSB CS271 12 Replication UCSB CS271 13 Preference Lists • List of nodes responsible for storing a particular key. • Due to failures, preference list contains more than N nodes. • Due to virtual nodes, preference list skips positions to ensure distinct physical nodes. UCSB CS271 14 Data Versioning • A put() call may return to its caller before the update has been applied at all the replicas • A get() call may return many versions of the same object. • Challenge: an object may have distinct versions • Solution: use vector clocks in order to capture causality between different versions of same object. UCSB CS271 15 Vector Clock • A vector clock is a list of (node, counter) pairs. • Every version of every object is associated with one vector clock. • If the all counters on the first object’s clock are less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten. • Application reconciles divergent versions and collapses into a single new version. UCSB CS271 16 Vector clock example UCSB CS271 17 Routing requests • Route request through a generic load balancer that will select a node based on load information. • Use a partition-aware client library that routes requests directly to relevant node. • A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories. UCSB CS271 18 Sloppy Quorum • R and W is the minimum number of nodes that must participate in a successful read/write operation. • Setting R + W > N yields a quorum-like system. • In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability. UCSB CS271 19 Highlights of Dynamo • High write availability • Optimistic: vector clocks for resolution • Consistent hashing (Chord) in controlled environment • Quorums for relaxed consistency. UCSB CS271 20 Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009 CASSANDRA (FACEBOOK) UCSB CS271 21 Data Model • Key-value store—more like Bigtable. • Basically, a distributed multi-dimensional map indexed by a key. • Value is structured into Columns, which are grouped into Column Families: simple and super (column family within a column family). • An operation is atomic on a single row. • API: insert, get and delete. UCSB CS271 22 System Architecture • Like Dynamo (and Chord). • Uses order preserving hash function on a fixed circular space. Node responsible for a key is called the coordinator. • Non-uniform data distribution: keep track of data distribution and reorganize if necessary. UCSB CS271 23 Replication • Each item is replicated at N hosts. • Replicas can be: Rack Unaware; Rack Aware (within a data center); Datacenter Aware. • System has an elected leader. • When a node joins the system, the leader assigns it a range of data items and replicas. • Each node is aware of every other node in the system and the range they are responsible for. UCSB CS271 24 Membership and Failure Detection • Gossip-based mechanism to maintain cluster membership. • A node determines which nodes are up and down using a failure detector. • The Φ accrual failure detector returns a suspicion level, Φ, for each monitored node. • Say a node suspects A when Φ=1, 2, 3, then the likelihood of a mistake is 10%, 1% and .1%. • Every node maintains a sliding window of interarrival times of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution. UCSB CS271 25 Operations • Use quorums: R and W • If R+W < N then read will return latest value. – Read operations return value with highest timestamp, so may return older versions – Read Repair: with every read, send newest version to any out-of-date replicas. – Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive) • Each write: first into a persistent commit log, then an in-memory data structure. UCSB CS271 26