OceanStore Global-Scale Persistent Storage Ying Lu 1 Give Credits • Many slides are from John Kubiatowicz, University of California at Berkeley • I have modified them and added new slides 2 Motivation • Personal Information Mgmt is the Killer App – Information management, analysis, aggregation, dissemination, filtering for the individual • Information Technology as a Utility – Continuous service delivery, on a planetaryscale, on top of a highly dynamic information base 3 OceanStore Context: Ubiquitous Computing (I) • Computing everywhere: – Desktop, Laptop, Palmtop, Cars, Cellphones – Shoes? Clothing? Walls? • Connectivity everywhere: – Rapid growth of bandwidth in the interior of the net – Broadband to the home and office – Wireless technologies such as CDMA, Satellite, laser 4 OceanStore Context: Ubiquitous Computing (II) • Rise of the thin-client metaphor: – Services provided by interior of network – Incredibly thin clients on the leaves • MEMs devices -- sensors+CPU+wireless net in 1mm3 • Mobile society: people move and devices are disposable 5 What do we need for personal information management? 6 Questions about information: • Where is persistent information stored? – 20th-century tie between location and content outdated • How is it protected? – Can disgruntled employee of ISP sell your secrets? – Can’t trust anyone (how paranoid are you?) • Can we make it indestructible? – Want our data to survive “the big one”! – Highly resistant to hackers (denial of service) – Wide-scale disaster recovery • Is it hard to manage? – Worst failures are human-related – Want automatic (introspective) diagnosis and repair 7 First Observation: Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the ultimate goal – Computers should disappear into the background • In storage context: – Don’t want to worry about backup, obsolescence – Need lots of resources to make data secure and highly available, BUT don’t want to own them – Outsourcing of storage already very popular • Pay monthly fee and your “data is out there” – Simple payment interface one bill from one company 8 Second Observation: Need wide-scale deployment • Many components with geographic separation – System not disabled by natural disasters – Can adapt to changes in demand and regional outages • Wide-scale use and sharing also requires widescale deployment – Bandwidth increasing rapidly, but latency bounded by speed of light • Handling many people with same system leads to economies of scale 9 OceanStore: Everyone’s data, One big Utility “The data is just out there” • Separate information from location – Locality is only an optimization (an important one!) – Wide-scale coding and replication for durability • All information is globally identified – Unique identifiers are hashes over names & keys – Single uniform lookup interface – No centralized namespace required 10 Amusing back of the envelope calculation (courtesy Bill Bolotsky, Microsoft) • How many files in the OceanStore? – Assume 1010 people in world – Say 10,000 files/person (very conservative?) – So 1014 files in OceanStore! – If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements… … but small relative to physical constants 11 Utility-based Infrastructure Canadian OceanStore Sprint AT&T Pac Bell IBM IBM • Service provided by confederation of companies – Monthly fee paid to one service provider – Companies buy and sell capacity from each other 12 Outline • Motivation • Properties of the OceanStore • Specific Technologies and approaches: – – – – – Naming and Data Location Conflict resolution on encrypted data Replication and Deep archival storage Introspective computing for optimization and repair Economic models • Conclusion 13 Ubiquitous Devices Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for OceanStore storage substrate: – Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks – Coherence: too much data for naïve users to keep coherent “by hand” – Automatic replica management and optimization: huge quantities of data cannot be managed manually – Simple and automatic recovery from disasters: probability of failure increases with size of system – Utility model: world-scale system requires cooperation 14 across administrative boundaries OceanStore Technologies I: Naming and Data Location • Requirements: – System-level names should help to authenticate data – Route to nearby data without global communication – Don’t inhibit rapid relocation of data • OceanStore approach: Two-level search with embedded routing – Underlying namespace is flat and built from secure cryptographic hashes (160-bit SHA-1) – Search process combines quick, probabilistic search with slower guaranteed search 15 Universal Location Facility • Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches Universal Name Floating Replica Name OID Version OID Active Data Global Object Resolution Root Structure Update OID: Archive versions: Version OID1 Version OID2 Version OID3 Global Object Resolution Global Object Resolution Erasure Coded: Archival copy or snapshot Archival copy or snapshot Commit Checkpoint Logs OID Global Object Resolution Archival copy or snapshot 16 Routing Two-tiered approach • Fast probabilistic routing algorithm – Entities that are accessed frequently are likely to reside close to where they are being used (ensured by introspection) Self-optimizing • Slower, guaranteed hierarchical routing method 17 Probabilistic Routing Algorithm self-optimizing 01234 bit on the depth of the reliable factors attenuated bloom filter array 10 11100 1st 1st 1st 11011 10 11011 2nd n1 n2 10101 Query for X (11010) 1st 2nd 3rd 11010 11011 11011 11100 11100 1st 00011 reliable factors 11000 00100 11010 100 100 X Y (0,1,3) (0,1,4) 11010 11001 n3 n4 00011 11011 100 self-protecting Bloom filter on each node; Attenuated Bloom filter on each directed edge. z M (1,3,4) (0,2,4) 18 Hierarchical Routing Algorithm • Based on Plaxton scheme • Every server in the system is assigned a random node-ID • Object’s root – each object is mapped to a single node whose nodeID matches the object’s GUID in the most bits (starting from the least significant) • Information about the GUID (such as location) were stored at its root 19 Construct Plaxton Mesh 1 x431 1 0324 1 1 x633 x742 2 3714 2 1 2 0265 1215 3 2344 x927 4 5724 9834 3 1624 7144 1324 … 20 GUID 0x43FE Basic Plaxton Mesh Incremental suffix-based routing 3 4 NodeID 0x79FE NodeID 0x23FE NodeID 0x993E NodeID 0x73FE 3 NodeID 0x44FE 2 3 NodeID 0x43FE 4 d 4 NodeID 0x035E 2 4 3 e 1 1 3 NodeID 0xF990 2 3 NodeID 0x555E 1 NodeID 0x73FF 2 NodeID c 0xABFE NodeID 0x13FE NodeID 0x423E 4 NodeID 0x9990 1 2 2 NodeID 0x04FE b 3 NodeID 0x239E 1 NodeID 0x1290 a 21 Use of Plaxton Mesh Randomization and Locality 22 OceanStore Enhancements of the Plaxton Mesh • Documents have multiple roots (Salted hash of GUID) • Each node has multiple neighbor links • Searches proceed along multiple paths – Tradeoff between reliability, performance and bandwidth? • Dynamic node insertion and deletion algorithms – Continuous repair and incremental optimization of links self-healing self-configuration self-optimizing 23 OceanStore Technologies II: Rapid Update in an Untrusted Infrastructure • Requirements: – Scalable coherence mechanism which can operate directly on encrypted data without revealing information – Handle Byzantine failures – Rapid dissemination of committed information • OceanStore Approach: – Operations-based interface using conflict resolution • Modeled after Xerox Bayou update packets include: Predicate/update pairs which operate on encrypted data – User signs Updates and principle party signs commits – Committed data multicast to clients 24 Update Model • Concurrent updates w/o wide-area locking – Conflict resolution • Updates Serialization • A master replica? • Role of primary tier of replicas – All updates submitted to primary tier of replicas which chooses a final total order by following Byzantine agreement protocol • A secondary tier of replicas – The result of the updates is multicast down the dissemination tree to all the secondary replicas 25 Tentative Updates: Epidemic Dissemination 26 Committed Updates: Multicast Dissemination 27 Data Coding Model • Two distinct forms of data: active and archival • Active Data in Floating Replicas – Latest version of the object • Archival Data in Erasure Coded Fragments – A permanent, read-only version of the object – During commit, previous version coded with erasure-code and spread over 100s or 1000s of nodes – Advantage: any 1/2 or 1/4 of fragments regenerates data 28 Floating Replica and Deep Archival Coding Full Cop y Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Full Cop y Floating Replica Full Cop y Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs 29 Erasure-coded Fragments Proactive Self-Maintenance • Continuous testing and repair of information – Slow sweep through all information to make sure there are sufficient erasure-coded fragments – Continuously reevaluate risk and redistribute data – Slow sweep and repair of metadata/search trees • Continuous online self-testing of HW and SW – Detects flaky, failing, or buggy components via: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: pushing HW/SW components past normal operating parameters • scrubbing: periodic restoration of potentially “decaying” hardware or software state – Automates preventive maintenance 30 OceanStore Technologies IV: Introspective Optimization • Requirements: – Reasonable job on global-scale optimization problem • Take advantage of locality whenever possible • Sensitivity to limited storage and bandwidth at endpoints – Repair of data structures, increasing of redundancy – Stability in chaotic environment Active Feedback • OceanStore Approach: – Introspective monitoring and analysis of relationships to cluster information by relatedness – Time series-analysis of user and data motion – Rearrangement and replication in response to monitoring • Clustered prefetching: fetch related objects 31 • Proactive-prefetching: get data there before needed Example: Client Introspection • Client observer and optimizer components – Greedy agents working on the behalf of the client • Watches client activity/combines with historical info • Performs clustering and time-series analysis • Forwards results to infrastructure (privacy issues!) – Monitoring state of network to adapt behaviour • Typical Actions: – Cluster related files together – Prefetch files that will be needed soon – Create/destroy floating replicas 32 OceanStore Conclusion • The Time is now for a Universal Data Utility – Ubiquitous computing and connectivity is (almost) here! – Confederation of utility providers is right model • OceanStore holds all data, everywhere – Local storage is a cache on global storage – Provides security in an untrusted infrastructure – Large scale system has good statistical properties • Use of introspection for performance and stability • Quality of individual servers enhances reliability • Exploits economies of scale to: – Provide high-availability and extreme survivability – Lower maintenance cost: • self-diagnosis and repair • Insensitivity to technology changes: Just unplug one set of servers, plug in others 33