OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels, Ramakrishna Gummadi, Sean Rhea, Hakim Weatherspoon, Westley Weimer, Chris Wells, Ben Zhao A few slides have been borrowed from the authors’ presentations Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” Source: Berkeley OceanStore Website Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • data • all kinds of information • desktop, laptop, palmtop • cars, cellular phones, other devices • futuristic: embedded in environment Vision • What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • persistence • devices can be rebooted, lost, replaced • reliable, durable data (“deep archival” will last forever) • Automatic maintenance Vision What is Oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • connectivity • even to tiniest devices, possibly intermittent • variable bandwidth, latency • availability • uniform access, comparable to LAN-based networked storage • fault-tolerant, DoS-tolerant Vision • what is oceanstore? • “a utility infrastructure to span the globe and provide continuous access to persistent information” • scale • geographically distributed • 1010 users • 1014 files / objects Questions about information: • Where is persistent information stored? • 20th-century tie between location and content outdated • In world-scale system, locality is key • How is it protected? • Can disgruntled employee of ISP sell your secrets? • Can’t trust anyone (how paranoid are you?) • Can we make it indestructible? • Want our data to survive “the big one”! • Highly resistant to hackers (denial of service) • Wide-scale disaster recovery • Is it hard to manage? • Worst failures are human-related • Want automatic (introspective) diagnosis and repair First Observation: Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the ultimate goal. Computers should disappear into the background • In the context of storage: • Don’t want to worry about backup • Don’t want to worry about obsolescence • Need lots of resources to make data secure and highly available, BUT don’t want to own them • Outsourcing of storage already becoming popular • Pay monthly fee and your “data is out there” Utility-based Infrastructure Canadian OceanStore Sprint AT&T Pac Bell IBM IBM • Service provided by confederation of companies • Monthly fee paid to one service provider • Companies buy and sell capacity from each other Target applications Email Group calendar, contacts Distributed design tools Computer Supported Cooperative Work Digital libraries Distributed/shared repositories Assumptions • Untrusted infrastructure • • • • A small number of servers may crash or leak information most of the servers functioning correctly financially “responsible party” of servers ensure integrity but only clients trusted with cleartext • Nomadic data • • • • • data divorced from location flows freely within the storage infrastructure promiscuous caching: “anywhere, anytime” location important for performance dynamic system tuning through introspection System overview • persistent object • GUID: 160-bit SHA-1 hash • secure identification – globally unique and unforgeable • 280 unique objects before collisions (birthday paradox) • floating object replicas: independent of location • encrypted data • read • try fast probabilistic replica search (Bloom filter) • fallback to slower deterministic search (Tapestry) • write • update with predicates [as in Bayou – what is Bayou?] • creates new version What is Bayou The Bayou System (Xerox PARC) is a platform of replicated, highly-available, variable-consistency, databases on which collaborative applications can be built. It caters to portable devices having intermittent connections. System overview • application interface • sessions: sequence of read/writes • session guarantees [Bayou] • loose consistency levels, ACID • active and archival forms • active: latest version, with update handle • archive: erasure coded read-only version • dynamic optimization • object location • degree of replication Tentative Updates: Epidemic Dissemination Committed Updates: Multicast Dissemination naming • self-certifying path names (Mazières) • object GUID = hash of owner key and readable name • create hierarchies using “directory” objects • read restriction • through client encryption of data • write restriction, access control • associate ACL lists with object, respected by servers addressing • address an object by its GUID • message: GUID, random number, small predicate • route to closest GUID replica matching predicate • combines data location and routing: • no central name service to attack • save one round-trip for location discovery • routing • fast, probabilistic search algorithm • slow, deterministic search algorithm routing • fast, probabilistic search algorithm • Bloom filter • probabilistic set membership test using bit vector • n-bit vector generated from n hashes of each set element • filter is union (OR) of all bit vectors • attenuated Bloom filter • array of d Bloom filters • i th Bloom filter is union of all <i -hop nodes • slow, deterministic algorithm • Tapestry addressing and routing deterministic probabilistic Attenuated Bloom Filter updates • Updates based on versioning and conflict resolution • i.e. no locking • update: actions with predicates • commit – apply action of first true predicate • abort – no true predicates • conflict resolution on encrypted data • possible predicates: • compare-version, compare-size, compare-block, search • possible actions: • replace-block, insert-block, delete-block, append archival • produced when objects idle • use erasure codes (redundant fragmentation) • simplest example: parity bit • need any (n-1) out of n fragments • interleaved Reed-Solomon codes, Tornado codes • fragmentation improves reliability • “deep archival storage” • sweeper processes ensure replication sustained over time • fragmentation improves performance Erasure Codes Simple parity bits, or generalized Reed-Solomon codes can be used to implement it. Floating Replica and Deep Archival Coding Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Floating Replica Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Erasure-coded Fragments dynamic optimization (introspection) • observation modules • collect and summarize information • incrementally update system database • optimization modules • periodically process the observation database • cluster recognition: group related objects • replica management: maintain replica number and location • periodic migration: work-home-work-home… • maintenance: routing, dissemination, availability, durability