OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000 OceanStore      1010 users, each with 10,000 files Global scale information storage. Mobile access to information in a uniform and highly available way. Servers are untrusted. Caches data anywhere, anytime. Monitors usage patterns. OceanStore [Rhea et al. 2003] Main Goals   Untrusted infrastructure Nomadic data Example applications     Groupware and PIM Email Digital libraries Scientific data repository Personal information management tools: calendars, emails, contact lists Example applications     Groupware and PIM Email Digital libraries Scientific data repository Challenges: scaling, consistency, migration, network failures Storage Organization OceanStore data object ~= file  Ordered sequence of read-only versions  Every version of every object kept forever  Can be used as backup   An object contains metadata, data, and references to previous versions Storage Organization  A stream of objects identified by AGUID Active globally-unique identifier  Cryptographically-secure hash of an application-specific name and the owner’s public key  Prevents namespace collisions  Storage Organization  Each version of data object stored in a Btree like data structure  Each block has a BGUID  Cryptographically-secure hash of the block content Each version has a VGUID  Two versions may share blocks  Storage Organization [Rhea et al. 2003] Access Control  Restricting readers:   Symmetric encryption key distributed to allowed readers. Restricting writers:    ACL. Signed writes. ACL for object chosen with signed certificate. Location and Routing Attenuated Bloom Filters Find 11010 Location and Routing Plaxton-like trees Updating data      All data is encrypted. A set of predicates is evaluated in order. The actions of the earliest true predicate are applied. Update is logged if it commits or aborts. Predicates:   compare-version, compare-block, compare-size, search Actions  replace-block, insert-block, delete-block, append Application-Specific Consistency An update is the operation of adding a new version to the head of a version stream  Updates are applied atomically  Represented as an array of potential actions  Each guarded by a predicate  Application-Specific Consistency  Example actions Replacing some bytes  Appending new data to an object  Truncating an object   Example predicates Check for the latest version number  Compare bytes  Application-Specific Consistency  To implement ACID semantic Check for readers  If none, update   Append to a mailbox   No checking No explicit locks or leases Application-Specific Consistency  Predicate for reads  Examples Can’t read something older than 30 seconds  Only can read data from a specific time frame  Replication and Consistency A data object is a sequence of read-only versions, consisting of read-only blocks, named by BGUIDs  No issues for replication  The mapping from AGUID to the latest VGUID may change  Use primary-copy replication  Serializing updates    A small primary tier of replicas run a Byzantine agreement protocol. A secondary tier of replicas optimistically propagate the update using an epidemic protocol. Ordering from primary tier is multicasted to secondary replicas. The Full Update Path Deep Archival Storage    Data is fragmented. Each fragment is an object. Erasure coding is used to increase reliability. Introspection computation optimization Uses: •Cluster recognition •Replica management •Other uses observation Software Architecture  Java atop the Staged Event Driven Architecture (SEDA) Each subsystem is implemented as a stage  With each own state and thread pool  Stages communicate through events  50,000 semicolons by five graduate students and many undergrad interns  Software Architecture Language Choice  Java: speed of development Strongly typed  Garbage collected  Reduced debugging time  Support for events  Easy to port multithreaded code in Java   Ported to Windows 2000 in one week Language Choice  Problems with Java: Unpredictability introduced by garbage collection  Every thread in the system is halted while the garbage collector runs  Any on-going process stalls for ~100 milliseconds  May add several seconds to requests travel cross machines  Experimental Setup  Two test beds  Local cluster of 42 machines at Berkeley Each with 2 1.0 GHz Pentium III  1.5GB PC133 SDRAM  2 36GB hard drives, RAID 0  Gigabit Ethernet adaptor  Linux 2.4.18 SMP  Experimental Setup  PlanetLab, ~100 nodes across ~40 sites 1.2 GHz Pentium III, 1GB RAM  ~1000 virtual nodes  Storage Overhead  For 32 choose 16 erasure encoding   2.7x for data > 8KB For 64 choose 16 erasure encoding  4.8x for data > 8KB The Latency Benchmark A single client submits updates of various sizes to a four-node inner ring  Metric: Time from before the request is signed to the signature over the result is checked  Update 40 MB of data over 1000 updates, with 100ms between updates  The Latency Benchmark Latency Breakdown Update Latency (ms) Key Size 512b 1024b Update 5% Size Time Median Time 95% Time Phase Time (ms) Check 0.3 Serialize 6.1 4kB 39 40 41 2MB 1037 1086 1348 Apply 1.5 4kB 98 99 100 Archive 4.5 2MB 1098 1150 1448 Sign 77.8 The Throughput Microbenchmark A number of clients submit updates of various sizes to disjoint objects, to a fournode inner ring  The clients  Create their objects  Synchronize themselves  Update the object as many time as possible for 100 seconds  The Throughput Microbenchmark Archive Retrieval Performance Populate the archive by submitting updates of various sizes to a four-node inner ring  Delete all copies of the data in its reconstructed form  A single client submits reads  Archive Retrieval Performance  Throughput: 1.19 MB/s (Planetlab)  2.59 MB/s (local cluster)   Latency  ~30-70 milliseconds The Stream Benchmark  Ran 500 virtual nodes on PlanetLab Inner Ring in SF Bay Area  Replicas clustered in 7 largest P-Lab sites   Streams updates to all replicas One writer - content creator – repeatedly appends to data object  Others read new versions as they arrive  Measure network resource consumption  The Stream Benchmark The Tag Benchmark Measures the latency of token passing  OceanStore 2.2 times slower than TCP/IP  The Andrew Benchmark File system benchmark  4.6x than NFS in read-intensive phases  7.3x slower in write-intensive phases  Bloom Filters   Compact data structures for a probabilistic representation of a set Appropriate to answer membership queries [Koloniari and Pitoura] Bloom Filters (cont’d) Bit vector v Element a 1 H1(a) = P1 H2(a) = P2 1 H3(a) = P3 1 H4(a) = P4 1 m bits Query for b: check the bits at positions H1(b), H2(b), ..., H4(b). back Pair-Wise Reconciliation Site A V0 : (x0) write x V1 : (x1) Site B V0: (x0) write x V2: (x2) Site C V0 : (x0) write x V3 : (x3) V4 : (x4) V5: (x5) [Kang et al. 2003] Hash History Reconciliation H0 H0 Site A V0 V1 H1 Site B V2 back Site C H0 H0 V3 H3 H2 V4 H0 H1 V5 H0 H2 H4 H1 H4 Hi = hash (Vi) H2 H3 H5 Erasure Codes n Message Encoding Algorithm cn Encoding Transmission n Received Decoding Algorithm n Message back [Mitzenmacher]

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000

Related documents

Products

Support

OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz, et al ASPLOS 2000

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib