OCEANSTORE

advertisement
OceanStore: An Architecture for
Global - Scale Persistent Storage
John Kubiatowicz, David Bindel, Yan Chen,
Steven Czerwinski, Patric Eaton, Dennis
Geels, Ramakrishna Gummadi, Sean Rhea,
Hakim Weatherspoon, Westley Weimer,
Chris Wells, and Ben Zhao
Ελευθερία Φιλτζαντζίδη, 2002
OceanStore
Ubiquitous Computing: Car, Clothing, Books,
Houses.
Computing devices must have high performance.
Computing devices should consume low power.
Computing devices must be transparent to the user.
Persistent information is necessary for
transparency.
Where does persistent information reside?
OceanStore
Requirements for a persistent infrastructure.
 Connectivity through: cable-modems, DSL, cell phones and
wireless data services.
 Information must be kept secure.
 Information must be extremely durable.
 Archiving of information should be automatic and reliable.
 Information must be divorced from location.
OceanStore is a utility infrastructure for
persistent storage.
OceanStore
 As a rough estimate, OceanStore will provide service to
roughly 1010 users, each with at least 10,000 files.
 OceanStore must therefore support over 1014 files.
 Consumers will pay a monthly fee in exchange for acess
to persistent storage follow services.
 Companies buy and sell capacity from each other.
 The core of the system is composed of a multitude of
highly connected “pools”.
The OceanStore system
Two Unique Goals
 Untrusted Infrastructure
 Servers may crash without warning or leak information to third
parties.
 Only clients can be trusted.
 All the information that enters the OceanStore is encrypted.
 Nomadic Data: Data that is allowed to flow freely.
 Promiscuous Caching: Data can be cached anywhere, anytime.
Applications
 Groupware and personal information management tools.
(calendars, email, contact lists and distributed design
tools)
 OceanStore can be used to create large digital libraries
and repositories for scientific data.
 OceanStore provides an ideal platform for new
streaming applications, such as sensor data aggregation
and dissemination.
System Architecture
System Overview
Naming
Access Control
Data Location and Routing
Update Model
Deep Archival Storage
Introspection
System Overview
 The fundamental unit is the persistent object
 Objects exist in both active and archival forms.
 Active Object: Is the latest version of its data together with a
handle for update.
 Archival Object:
 Permanent read-only version of the object.
 Archival Objects are encoded with an erasure code.
 The OceanStore API provides: sessions, session
guarantees, updates and callbacks.
 OceanStore provides an array of familiar interfaces such
as the Unix and a transactional interface.
Naming
 Objects are identified by a GUID, a pseudo-random,
fixed-length bit string.
 An object GUID is the secure hash of the owner’s key
and some human-readable name (Self-certifying path).
 Certain objects act as directories, mapping humanreadable names to GUIDs (SDSI).
 A user can choose several directories as “roots”. The
system as a whole has no “roots”.
Access Control
 Reader restriction
All data that is not completely public is encrypted.
The encryption key is distributed to those users with
read permission.
 Writer restriction
All writes can be verified against an access control list
(ACL).
An owner of an object can choose the ACL x for an
object foo by providing a signed certificate.
Data Location and Routing
 OceanStore messages are labeled with a destination
GUID, a random number, and a small predicate.
 OceanStore combines data location and routing
 The task of routing is handled by the aggregate resources of
many different node.
 Messages route directly to destinations.
 The underlying infrastructure has more up-to-date routing
information.
 Routing mechanism is a two-tiered approach.
 Probabilistic algorithm.
 Deterministic algorithm.
Probabilistic Algorithm
 Is fully distributed and uses a constant amount of
storage per server.
 Using an array of D normal Bloom filter the attenuated
Bloom filter.
 The first filter contains the objects which are locally on the node.
 The ith Bloom Filter is the union of all the Bloom filters for all of
the nodes a distance I through any path from the current node.
 Bloom Filter : A method for representing a set A.
 It consists of a vector B of m bits and k hash functions h1,h2,..hk
of range {1..n}.
 For each element of the set A, the bits at positions
h1(a),h2(a)..hk(a) of the vector B is set to 1.
 Given a query for b we check the bits at positions h1(a), h2(a),
..hk(a).
The Probabilistic Query
Process
The Global Algorithm
 OceanStore uses a variation on Plaxton et. al.’s
hierarchical distributed data structure.
 The Basic Plaxton scheme
 Every server in the system is assigned a random node-ID.
 Each link is labeled with a level number.
 In OceanStore each object is mapped to a single node
whose node-ID matches the object’s GUID in the most
bits. This node is called object’s root.
 The location of a replica is “published” in the
infrastructure.
 This process requires O(logn) hops.
The Global Algorithm
The Global Algorithm
629
629
529
529
109
479
479
675
116
116
Inserting Object #62942
Searching Object #62942
Object Location
Search Client
Root Node
Update Model
 Update model base on Conflict Resolution - Bayou
System.
 Update: list of predicates associated with actions.
 Commit
 Abort
 Possible Predicates
 compare-version, compare size : applied to unencrypted metadata of an object.
 compare-block: the encryption technology is a positiondependent block cipher.
 search: is preformed directly to cipher data.
Update Model
 Operations applied to ciphertext
 replace block, append block: a position dependent block cipher.
 insert block, delete block
 Two sets of blocks: index blocks, data blocks.
Update Model
 Someone must choose the final commit order of
updates.
 OceanStore choose two tier of replicas
 A primary tier of replicas:
 Byzantine agreement protocol.
 Small number of replicas located in high bandwidth, high connectivity regions of the network.
 Stronger consistency guarantees.
 A secondary tier of replicas:





Epidemic Algorithm.
They are organized into dissemination trees.
Contain both tentative and committed data.
Secondary replicas order tentative updates in timestamp order.
Lesser degree of consistency.
Update Model



After generating an update, a client send it to the object’s primary tier,as well as to
several random replicas for that object.
The primary tier performs the Byzantine protocol. The secondary replicas propagate
the update among themselves epidemically.
The result is multicast down the dissemination tree.
Deep Archival Storage
 The archival mechanism employs erasure codes
(interleaved Read-Solomon, Tornado codes).
 Erasure coding treats data as a series of fragments and
transforms these fragments into a greater number of
fragments.
 The fragments are spread widely. Any n of the coded
fragments are sufficient to construct the original data.
 Fragmentation increases reliability and survivability.
Introspection
 Introspection augments a system’s normal operation
(computation) with observation and optimization.
The Cycle of Introspection
An architecture for introspective systems in
OceanStore
Status
 The first implementation is deployed in Java.
 They use the Unix file system interface and a read-only
proxy for the WWW.
 They have explored the security guarantees that are
required for the OceanStore.
 Included Components
 A prototype for the probabilistic algorithm.
 Prototype archival systems that use Read-Solomon and Tornado
codes for redundancy encoding.
 An introspective prefetching mechanism for a local file system.
Download