OceanStore: An Architecture for Global- Scale Persistent Storage

advertisement
OceanStore: An
Architecture for GlobalScale Persistent Storage
Introduction
 Vision: ubiquitous computing devices
 Goal: transparency




Where to store persistent information?
How to protect against system failures?
How to upgrade components without losing
configuration info?
How to manage consistency?
Introduction
 Requirements
 Intermittent connectivity
 Secure from theft and denial-of-service
 Durable information


Information divorced from location




Automatic and reliable archival services
Geographically distributed servers
Caching close to clients
Information can migrate to wherever it is needed
Scale: 1010 users, each with 10,000 files
OceanStore: A True Data Utility
 Utility model: consumers pay a monthly fee
in exchange for access to persistent storage





Highly available data from anywhere
Automatic replication for disaster recovery
Strong security
Providers would buy and sell capacity among
themselves for mobile users
Deep archival storage: use excess of storage
space to ease data management
Two Unique Goals
 Use untrusted infrastructure



May crash without warning
Encrypted information in the infrastructure
Responsible party is financially responsible for
the integrity of data
 Support nomadic data


Data can be cached anywhere, anytime
Continuous introspective monitoring to
manage caching and locality
System Overview
 The fundamental unit in OceanStore: a
persistent object



Named by a globally unique identifier (GUID)
Replicated and stored on multiple servers
Independent of the server (floating replicas)
 Two mechanisms to locate a replica


Probabilistically probe neighboring machines
Slower deterministic algorithm
OceanStore Updates
 Each update (or groups of updates) to an
object creates a new version
 Consistency is based on versioning


No need for backup
Pointers are permanent
OceanStore Objects
 An active object is the latest version of its
data
 An archival object is a permanent, read-only
version of the object


Encoded with an erasure code
Any m out of n fragments can reconstruct the
original data
 Can support either weak or strong
consistency models
Applications
 Groupware: calendar, email, contact lists,
distributed design tools


Allow concurrent updates
Provide ways to merge information and detect
conflicts
Applications
 Digital libraries




Require massive quantities of storage
Replication for durability and availability
Deep archival storage to survive disaster
Seamless migration of data to where it is
needed
 Sensor data aggregation and dissemination
Naming
 GUID: pseudo-random fixed-length bit string
 Naming facility



Decentralized
Self-certifying path names
GUID = hash(user key, file name)
 Multiple roots in OceanStore
 GUID of a server is a secure hash of its key
 GUID of a data fragment is a secure hash of
the data content
Access Control
 Reader restriction


Encrypt all data
Revocation



Delete all replicas
Encrypt all replicas with a new key
A server can use old keys to access cached old
data
Access Control
 Writer restriction

Writes are signed
 Reads are restricted at clients
 Writes are restricted at servers
Data Location and Routing
 Objects can reside on any of the OceanStore
servers
 Use query routing to locate objects
Distributed Routing in OceanStore
 Every object is identified by one or more
GUIDs
 Different replicas of the same object has the
same GUID
 OceanStore messages are labeled with



A destination GUID (built on top of IP)
A random number
A small predicate
Bloom Filters
 Based on the idea of hill-climbing
 If a query cannot be satisfied by a server,
local information is used to route the query to
a likely neighbor

Via a modified version of a Bloom filter
Bloom Filter
 A Bloom filter



Represents a set S = {S1, … Sn}
Is depicted by a m bit array, filter[m]
Uses r independent hash functions

h1…hr
 for i = 1…n

for j = 1…r

filter[hj[Si]] = 1
Insertion Example
 m = 6, r = 3
 To insert word x




h1(x) = 0
h2(x) = 3
h3(x) = 5
filter[] = {1, 0, 0, 1, 0, 1}
Insertion Example
 m = 6, r = 3
 To insert word y




h1(y) = 1
h2(y) = 3
h3(y) = 5
filter[] = {1, 1, 0, 1, 0, 1}
Testing Example
 filter[] = {1, 1, 0, 1, 0, 1}
 Does x belong to the set?



filter[h1(x)] = filter[0] = 1
filter[h2(x)] = filter[3] = 1
filter[h3(x)] = filter[5] = 1
 Does z belong to the set?



filter[h1(z)] = filter[2] = 0  no
filter[h2(z)] = filter[3] = 1
filter[h3(z)] = filter[5] = 1
False Positives
 If filter[i] = 0, it’s not in S
 If filter[i] = 1, it’s probably in S
 False positive rate depends on



Number of hash functions
Array size
Number of unique elements in S
Attenuated Bloom Filters
 An attenuated Bloom filter of depth D is an
array of D normal Bloom filters
 ith Bloom filter is the union of all the Bloom
filters for all of the nodes at a distance i
 One filter per network edge
Attenuated Bloom Filters
 Lookup 11010
The Global Algorithm: Wide-Scale
Distributed Data Location
 Plaxton’s randomized hierarchical distributed
data structure
 Resolve one digit of the node id at a time
The Global Algorithm: Wide-Scale
Distributed Data Location
Achieving Locality
 Each new replica only needs to traverse
O(log(n)) hops to reach the root, where n is
the number of the servers
Achieving Fault Tolerance
 Avoid failures at roots
 Each root GUID is hashed with a small
number of different salt values
 Make it difficult to target a single GUID for
DoS attacks
 If failures are detected, just jump to any node
to reach the root
 OceanStore continually monitors and repairs
broken pointers
Advantages of Distributed Information
 Redundant paths to roots
 Scalable with a combination of probabilistic
and global algorithms
 Easy to locate and recover failed components
 Plaxton links form a natural substrate for
admission controls and multicasts
Achieving Maintenance-Free
Operation
 Recursive node insertion and removal
 Replicated roots
 Use beacons to detect faults
 Time-to-live fields to update routes
 Second-chance algorithm to avoid false
diagnoses of failed components

Avoid the cost of recovering lost nodes
 Automatic reconstruction of data for failed
servers
Update Model
 Conflict resolution update model
 Challenge:


Untrusted infrastructure
Access only to ciphertext
Update Format and Semantics
 An update: a list of predicates associated
with actions
 If any of the predicates evaluates to be true,
the actions associated with the earliest true
predicate are atomically applied
 Everything is logged
Extending the Model to Work over
Ciphertext
 Supported predicates



Compare version (unencrypted metadata)
Compare size (unencrypted metadata)
Compare block


Search



Compare a hash of the encrypted block
Returns only yes/no
Cannot be initiated by the server
Replace/insert/delete/append block
Serializing Updates in an Untrusted
Infrastructure
 Use a small primary tier of replicas to
serialize updates

Minimize communication
 Meanwhile, a secondary tier of replicas
optimistically propagate updates among
themselves
 Final ordering from primary tier is multicasted
to secondary replicas
A Direct Path to Clients and Archival
Storage
 Updates flow directly from a client to the
primary tier, where they are serialized and
then multicast to the secondary servers
 Updates are tightly coupled with archival

Archival fragments are generated at
serialization time and distributed with updates
Efficiency of the Consistency Protocol
 For updates > 4Kbytes, network overhead <
100%
 Approximate latency per update < 1 second
Deep Archival Storage
 Erasure encoded block fragments
 Use small and widely distributed fragments to
increase reliability
 Administrative domains are ranked by their
reliability and trustworthiness
 Avoid locations with correlated failures
The OceanStore API
 Session: a sequence of reads and writes to
potentially different objects
 Session guarantees: define the level of
consistency
 Updates
 Callback: for user defined events (commit)
 Façade: an interface to the conventional API

UNIX file system, transactional databases,
WWW gateways
Introspection
 Observation modules monitor the activity of a
running system and track system behavior
 Optimization modules adjust the computation
computation
optimization
observation
Uses of Introspection
 Cluster recognition

Identify related files
 Replica management


Adjust replication factors
Migrate floating replicas
Related Work
 Space/time trade-offs in hash coding with
allowable errors. In Communications of the
ACM, 13(7), pp. 422-426, July 1970
 The Bayou architecture: Support for data
sharing among mobile users. In Proc. of
IEEE Workshop on Mobile Computing
Systems and Applications, Dec 1994
Related Work
 A tutorial on reed-solomon coding for faulting
tolerance in raid-like systems. Software
Practice and Experience, 27(9), pp. 9951012, September 1997
 Accessing nearby copies of replicated objects
in a distributed environment. In Proc. of ACM
SPAA, June 1997
 Search on encrypted data. IEEE SRSP, May
2000
Download