OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, and Ben Zhao Ελευθερία Φιλτζαντζίδη, 2002 OceanStore Ubiquitous Computing: Car, Clothing, Books, Houses. Computing devices must have high performance. Computing devices should consume low power. Computing devices must be transparent to the user. Persistent information is necessary for transparency. Where does persistent information reside? OceanStore Requirements for a persistent infrastructure. Connectivity through: cable-modems, DSL, cell phones and wireless data services. Information must be kept secure. Information must be extremely durable. Archiving of information should be automatic and reliable. Information must be divorced from location. OceanStore is a utility infrastructure for persistent storage. OceanStore As a rough estimate, OceanStore will provide service to roughly 1010 users, each with at least 10,000 files. OceanStore must therefore support over 1014 files. Consumers will pay a monthly fee in exchange for acess to persistent storage follow services. Companies buy and sell capacity from each other. The core of the system is composed of a multitude of highly connected “pools”. The OceanStore system Two Unique Goals Untrusted Infrastructure Servers may crash without warning or leak information to third parties. Only clients can be trusted. All the information that enters the OceanStore is encrypted. Nomadic Data: Data that is allowed to flow freely. Promiscuous Caching: Data can be cached anywhere, anytime. Applications Groupware and personal information management tools. (calendars, email, contact lists and distributed design tools) OceanStore can be used to create large digital libraries and repositories for scientific data. OceanStore provides an ideal platform for new streaming applications, such as sensor data aggregation and dissemination. System Architecture System Overview Naming Access Control Data Location and Routing Update Model Deep Archival Storage Introspection System Overview The fundamental unit is the persistent object Objects exist in both active and archival forms. Active Object: Is the latest version of its data together with a handle for update. Archival Object: Permanent read-only version of the object. Archival Objects are encoded with an erasure code. The OceanStore API provides: sessions, session guarantees, updates and callbacks. OceanStore provides an array of familiar interfaces such as the Unix and a transactional interface. Naming Objects are identified by a GUID, a pseudo-random, fixed-length bit string. An object GUID is the secure hash of the owner’s key and some human-readable name (Self-certifying path). Certain objects act as directories, mapping humanreadable names to GUIDs (SDSI). A user can choose several directories as “roots”. The system as a whole has no “roots”. Access Control Reader restriction All data that is not completely public is encrypted. The encryption key is distributed to those users with read permission. Writer restriction All writes can be verified against an access control list (ACL). An owner of an object can choose the ACL x for an object foo by providing a signed certificate. Data Location and Routing OceanStore messages are labeled with a destination GUID, a random number, and a small predicate. OceanStore combines data location and routing The task of routing is handled by the aggregate resources of many different node. Messages route directly to destinations. The underlying infrastructure has more up-to-date routing information. Routing mechanism is a two-tiered approach. Probabilistic algorithm. Deterministic algorithm. Probabilistic Algorithm Is fully distributed and uses a constant amount of storage per server. Using an array of D normal Bloom filter the attenuated Bloom filter. The first filter contains the objects which are locally on the node. The ith Bloom Filter is the union of all the Bloom filters for all of the nodes a distance I through any path from the current node. Bloom Filter : A method for representing a set A. It consists of a vector B of m bits and k hash functions h1,h2,..hk of range {1..n}. For each element of the set A, the bits at positions h1(a),h2(a)..hk(a) of the vector B is set to 1. Given a query for b we check the bits at positions h1(a), h2(a), ..hk(a). The Probabilistic Query Process The Global Algorithm OceanStore uses a variation on Plaxton et. al.’s hierarchical distributed data structure. The Basic Plaxton scheme Every server in the system is assigned a random node-ID. Each link is labeled with a level number. In OceanStore each object is mapped to a single node whose node-ID matches the object’s GUID in the most bits. This node is called object’s root. The location of a replica is “published” in the infrastructure. This process requires O(logn) hops. The Global Algorithm The Global Algorithm 629 629 529 529 109 479 479 675 116 116 Inserting Object #62942 Searching Object #62942 Object Location Search Client Root Node Update Model Update model base on Conflict Resolution - Bayou System. Update: list of predicates associated with actions. Commit Abort Possible Predicates compare-version, compare size : applied to unencrypted metadata of an object. compare-block: the encryption technology is a positiondependent block cipher. search: is preformed directly to cipher data. Update Model Operations applied to ciphertext replace block, append block: a position dependent block cipher. insert block, delete block Two sets of blocks: index blocks, data blocks. Update Model Someone must choose the final commit order of updates. OceanStore choose two tier of replicas A primary tier of replicas: Byzantine agreement protocol. Small number of replicas located in high bandwidth, high connectivity regions of the network. Stronger consistency guarantees. A secondary tier of replicas: Epidemic Algorithm. They are organized into dissemination trees. Contain both tentative and committed data. Secondary replicas order tentative updates in timestamp order. Lesser degree of consistency. Update Model After generating an update, a client send it to the object’s primary tier,as well as to several random replicas for that object. The primary tier performs the Byzantine protocol. The secondary replicas propagate the update among themselves epidemically. The result is multicast down the dissemination tree. Deep Archival Storage The archival mechanism employs erasure codes (interleaved Read-Solomon, Tornado codes). Erasure coding treats data as a series of fragments and transforms these fragments into a greater number of fragments. The fragments are spread widely. Any n of the coded fragments are sufficient to construct the original data. Fragmentation increases reliability and survivability. Introspection Introspection augments a system’s normal operation (computation) with observation and optimization. The Cycle of Introspection An architecture for introspective systems in OceanStore Status The first implementation is deployed in Java. They use the Unix file system interface and a read-only proxy for the WWW. They have explored the security guarantees that are required for the OceanStore. Included Components A prototype for the probabilistic algorithm. Prototype archival systems that use Read-Solomon and Tornado codes for redundancy encoding. An introspective prefetching mechanism for a local file system.