OceanStore: An Architecture for Global- Scale Persistent Storage

OceanStore: An
Architecture for GlobalScale Persistent Storage
 Vision: ubiquitous computing devices
 Goal: transparency
Where to store persistent information?
How to protect against system failures?
How to upgrade components without losing
configuration info?
How to manage consistency?
 Requirements
 Intermittent connectivity
 Secure from theft and denial-of-service
 Durable information
Information divorced from location
Automatic and reliable archival services
Geographically distributed servers
Caching close to clients
Information can migrate to wherever it is needed
Scale: 1010 users, each with 10,000 files
OceanStore: A True Data Utility
 Utility model: consumers pay a monthly fee
in exchange for access to persistent storage
Highly available data from anywhere
Automatic replication for disaster recovery
Strong security
Providers would buy and sell capacity among
themselves for mobile users
Deep archival storage: use excess of storage
space to ease data management
Two Unique Goals
 Use untrusted infrastructure
May crash without warning
Encrypted information in the infrastructure
Responsible party is financially responsible for
the integrity of data
 Support nomadic data
Data can be cached anywhere, anytime
Continuous introspective monitoring to
manage caching and locality
System Overview
 The fundamental unit in OceanStore: a
persistent object
Named by a globally unique identifier (GUID)
Replicated and stored on multiple servers
Independent of the server (floating replicas)
 Two mechanisms to locate a replica
Probabilistically probe neighboring machines
Slower deterministic algorithm
OceanStore Updates
 Each update (or groups of updates) to an
object creates a new version
 Consistency is based on versioning
No need for backup
Pointers are permanent
OceanStore Objects
 An active object is the latest version of its
 An archival object is a permanent, read-only
version of the object
Encoded with an erasure code
Any m out of n fragments can reconstruct the
original data
 Can support either weak or strong
consistency models
 Groupware: calendar, email, contact lists,
distributed design tools
Allow concurrent updates
Provide ways to merge information and detect
 Digital libraries
Require massive quantities of storage
Replication for durability and availability
Deep archival storage to survive disaster
Seamless migration of data to where it is
 Sensor data aggregation and dissemination
 GUID: pseudo-random fixed-length bit string
 Naming facility
Self-certifying path names
GUID = hash(user key, file name)
 Multiple roots in OceanStore
 GUID of a server is a secure hash of its key
 GUID of a data fragment is a secure hash of
the data content
Access Control
 Reader restriction
Encrypt all data
Delete all replicas
Encrypt all replicas with a new key
A server can use old keys to access cached old
Access Control
 Writer restriction
Writes are signed
 Reads are restricted at clients
 Writes are restricted at servers
Data Location and Routing
 Objects can reside on any of the OceanStore
 Use query routing to locate objects
Distributed Routing in OceanStore
 Every object is identified by one or more
 Different replicas of the same object has the
same GUID
 OceanStore messages are labeled with
A destination GUID (built on top of IP)
A random number
A small predicate
Bloom Filters
 Based on the idea of hill-climbing
 If a query cannot be satisfied by a server,
local information is used to route the query to
a likely neighbor
Via a modified version of a Bloom filter
Bloom Filter
 A Bloom filter
Represents a set S = {S1, … Sn}
Is depicted by a m bit array, filter[m]
Uses r independent hash functions
 for i = 1…n
for j = 1…r
filter[hj[Si]] = 1
Insertion Example
 m = 6, r = 3
 To insert word x
h1(x) = 0
h2(x) = 3
h3(x) = 5
filter[] = {1, 0, 0, 1, 0, 1}
Insertion Example
 m = 6, r = 3
 To insert word y
h1(y) = 1
h2(y) = 3
h3(y) = 5
filter[] = {1, 1, 0, 1, 0, 1}
Testing Example
 filter[] = {1, 1, 0, 1, 0, 1}
 Does x belong to the set?
filter[h1(x)] = filter[0] = 1
filter[h2(x)] = filter[3] = 1
filter[h3(x)] = filter[5] = 1
 Does z belong to the set?
filter[h1(z)] = filter[2] = 0  no
filter[h2(z)] = filter[3] = 1
filter[h3(z)] = filter[5] = 1
False Positives
 If filter[i] = 0, it’s not in S
 If filter[i] = 1, it’s probably in S
 False positive rate depends on
Number of hash functions
Array size
Number of unique elements in S
Attenuated Bloom Filters
 An attenuated Bloom filter of depth D is an
array of D normal Bloom filters
 ith Bloom filter is the union of all the Bloom
filters for all of the nodes at a distance i
 One filter per network edge
Attenuated Bloom Filters
 Lookup 11010
The Global Algorithm: Wide-Scale
Distributed Data Location
 Plaxton’s randomized hierarchical distributed
data structure
 Resolve one digit of the node id at a time
The Global Algorithm: Wide-Scale
Distributed Data Location
Achieving Locality
 Each new replica only needs to traverse
O(log(n)) hops to reach the root, where n is
the number of the servers
Achieving Fault Tolerance
 Avoid failures at roots
 Each root GUID is hashed with a small
number of different salt values
 Make it difficult to target a single GUID for
DoS attacks
 If failures are detected, just jump to any node
to reach the root
 OceanStore continually monitors and repairs
broken pointers
Advantages of Distributed Information
 Redundant paths to roots
 Scalable with a combination of probabilistic
and global algorithms
 Easy to locate and recover failed components
 Plaxton links form a natural substrate for
admission controls and multicasts
Achieving Maintenance-Free
 Recursive node insertion and removal
 Replicated roots
 Use beacons to detect faults
 Time-to-live fields to update routes
 Second-chance algorithm to avoid false
diagnoses of failed components
Avoid the cost of recovering lost nodes
 Automatic reconstruction of data for failed
Update Model
 Conflict resolution update model
 Challenge:
Untrusted infrastructure
Access only to ciphertext
Update Format and Semantics
 An update: a list of predicates associated
with actions
 If any of the predicates evaluates to be true,
the actions associated with the earliest true
predicate are atomically applied
 Everything is logged
Extending the Model to Work over
 Supported predicates
Compare version (unencrypted metadata)
Compare size (unencrypted metadata)
Compare block
Compare a hash of the encrypted block
Returns only yes/no
Cannot be initiated by the server
Replace/insert/delete/append block
Serializing Updates in an Untrusted
 Use a small primary tier of replicas to
serialize updates
Minimize communication
 Meanwhile, a secondary tier of replicas
optimistically propagate updates among
 Final ordering from primary tier is multicasted
to secondary replicas
A Direct Path to Clients and Archival
 Updates flow directly from a client to the
primary tier, where they are serialized and
then multicast to the secondary servers
 Updates are tightly coupled with archival
Archival fragments are generated at
serialization time and distributed with updates
Efficiency of the Consistency Protocol
 For updates > 4Kbytes, network overhead <
 Approximate latency per update < 1 second
Deep Archival Storage
 Erasure encoded block fragments
 Use small and widely distributed fragments to
increase reliability
 Administrative domains are ranked by their
reliability and trustworthiness
 Avoid locations with correlated failures
The OceanStore API
 Session: a sequence of reads and writes to
potentially different objects
 Session guarantees: define the level of
 Updates
 Callback: for user defined events (commit)
 Façade: an interface to the conventional API
UNIX file system, transactional databases,
WWW gateways
 Observation modules monitor the activity of a
running system and track system behavior
 Optimization modules adjust the computation
Uses of Introspection
 Cluster recognition
Identify related files
 Replica management
Adjust replication factors
Migrate floating replicas
Related Work
 Space/time trade-offs in hash coding with
allowable errors. In Communications of the
ACM, 13(7), pp. 422-426, July 1970
 The Bayou architecture: Support for data
sharing among mobile users. In Proc. of
IEEE Workshop on Mobile Computing
Systems and Applications, Dec 1994
Related Work
 A tutorial on reed-solomon coding for faulting
tolerance in raid-like systems. Software
Practice and Experience, 27(9), pp. 9951012, September 1997
 Accessing nearby copies of replicated objects
in a distributed environment. In Proc. of ACM
SPAA, June 1997
 Search on encrypted data. IEEE SRSP, May