pptx - UCSB Computer Science

advertisement
DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin,
Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available
key-value store. SOSP 2007
AMAZON’S KEY-VALUE STORE:
DYNAMO
Adapted from Amazon’s Dynamo Presentation
UCSB CS271
1
Motivation
•
•
•
•
Reliability at a massive scale
Slightest outage  significant financial consequences
High write availability
Amazon’s platform: 10s of thousands of servers and
network components, geographically dispersed
• Provide persistent storage in spite of failures
• Sacrifice consistency to achieve performance,
reliability, and scalability
UCSB CS271
2
Dynamo Design rationale
• Most services need key-based access:
– Best-seller lists, shopping carts, customer
preferences, session management, sales rank,
product catalog, and so on.
• Prevalent application design based on RDBMS
technology will be catastrophic.
• Dynamo therefore provides primary-key only
interface.
UCSB CS271
3
Dynamo Design Overview
•
•
•
•
•
Data partitioning using consistent hashing
Data replication
Consistency via version vectors
Replica synchronization via quorum protocol
Gossip-based failure-detection and
membership protocol
UCSB CS271
4
System Requirements
• Data & Query Model:
– Read/write operations via primary key
– No relational schema: use <key, value> object
– Object size < 1 MB, typically.
• Consistency guarantees:
– Weak
– Only single key updates
– Not clear if read-modify-write isolate
• Efficiency:
– SLA 99.9 percentile of operations
• Notes:
– Commodity hardware
– Minimal security measures since for internal use
UCSB CS271
5
Service Level Agreements (SLA)
• Application can deliver its functionality
in a bounded time: Every dependency in the
platform needs to deliver its functionality with even
tighter bounds.
• Example SLA: service guaranteeing that it will
provide a response within 300ms for 99.9% of its
requests for a peak client load of 500 requests per
second.
UCSB CS271
6
System Interface
• Two basic operations:
– Get(key):
• Locates replicas
• Returns the object + context (encodes meta data
including version)
– Put(key, context, object):
• Writes the replicas to the disk
• Context: version (vector timestamp)
• Hash(key)  128-bit identifier
UCSB CS271
7
Partition Algorithm
• Consistent hashing: the
output range of a hash function
is treated as a fixed circular
space or “ring” a la Chord.
• “Virtual Nodes”: Each node
can be responsible for more
than one virtual node (to deal
with non-uniform data and load
distribution)
UCSB CS271
8
Virtual Nodes
UCSB CS271
9
Advantages of using virtual nodes
• The number of virtual nodes that a node
is responsible can be decided based on
its capacity, accounting for heterogeneity
in the physical infrastructure.
• A real node’s load can be distributed
across the ring, thus ensuring a hot spot
is not targeted to a single node.
• If a node becomes unavailable the load
handled by this node is evenly dispersed
across the remaining available nodes.
• When a node becomes available again,
the newly available node accepts a
roughly equivalent amount of load from
each of the other available nodes.
UCSB CS271
10
Replication
• Each data item is
replicated at N hosts.
• preference list: The list of
nodes that is responsible
for storing a particular key.
• Some fine-tuning to
account for virtual nodes
UCSB CS271
11
Replication
UCSB CS271
12
Replication
UCSB CS271
13
Preference Lists
• List of nodes responsible for storing a
particular key.
• Due to failures, preference list contains more
than N nodes.
• Due to virtual nodes, preference list skips
positions to ensure distinct physical nodes.
UCSB CS271
14
Data Versioning
• A put() call may return to its caller before the
update has been applied at all the replicas
• A get() call may return many versions of the
same object.
• Challenge: an object may have distinct versions
• Solution: use vector clocks in order to capture
causality between different versions of same object.
UCSB CS271
15
Vector Clock
• A vector clock is a list of (node, counter) pairs.
• Every version of every object is associated
with one vector clock.
• If the all counters on the first object’s clock are
less-than-or-equal to all of the counters in the
second clock, then the first is an ancestor of
the second and can be forgotten.
• Application reconciles divergent versions and
collapses into a single new version.
UCSB CS271
16
Vector clock example
UCSB CS271
17
Routing requests
• Route request through a generic load
balancer that will select a node based on
load information.
• Use a partition-aware client library that
routes requests directly to relevant node.
• A gossip protocol propagates membership
changes. Each node contacts a peer
chosen at random every second and the
two nodes reconcile their membership
change histories.
UCSB CS271
18
Sloppy Quorum
• R and W is the minimum number of nodes
that must participate in a successful
read/write operation.
• Setting R + W > N yields a quorum-like system.
• In this model, the latency of a get (or put)
operation is dictated by the slowest of the R
(or W) replicas. For this reason, R and W are
usually configured to be less than N, to
provide better latency and availability.
UCSB CS271
19
Highlights of Dynamo
• High write availability
• Optimistic: vector clocks for resolution
• Consistent hashing (Chord) in controlled
environment
• Quorums for relaxed consistency.
UCSB CS271
20
Lakshman and Malik Cassandra—A Decentralized Structured Storage
System. LADIS 2009
CASSANDRA (FACEBOOK)
UCSB CS271
21
Data Model
• Key-value store—more like Bigtable.
• Basically, a distributed multi-dimensional map
indexed by a key.
• Value is structured into Columns, which are
grouped into Column Families: simple and
super (column family within a column family).
• An operation is atomic on a single row.
• API: insert, get and delete.
UCSB CS271
22
System Architecture
• Like Dynamo (and Chord).
• Uses order preserving hash function on a fixed
circular space. Node responsible for a key is
called the coordinator.
• Non-uniform data distribution: keep track of
data distribution and reorganize if necessary.
UCSB CS271
23
Replication
• Each item is replicated at N hosts.
• Replicas can be: Rack Unaware; Rack Aware
(within a data center); Datacenter Aware.
• System has an elected leader.
• When a node joins the system, the leader
assigns it a range of data items and replicas.
• Each node is aware of every other node in the
system and the range they are responsible for.
UCSB CS271
24
Membership and Failure Detection
• Gossip-based mechanism to maintain cluster membership.
• A node determines which nodes are up and down using a
failure detector.
• The Φ accrual failure detector returns a suspicion level, Φ,
for each monitored node.
• Say a node suspects A when Φ=1, 2, 3, then the likelihood
of a mistake is 10%, 1% and .1%.
• Every node maintains a sliding window of interarrival times
of gossip messages from other nodes to determine
distribution of interarrival times and then calculate Φ.
Approximate using an exponential distribution.
UCSB CS271
25
Operations
• Use quorums: R and W
• If R+W < N then read will return latest value.
– Read operations return value with highest
timestamp, so may return older versions
– Read Repair: with every read, send newest
version to any out-of-date replicas.
– Anti-Entropy: compute Merkle tree to catch any
out of synch data (expensive)
• Each write: first into a persistent commit log,
then an in-memory data structure.
UCSB CS271
26
Download