OceanStore

advertisement
OceanStore
Global-Scale Persistent Storage
Ying Lu
1
Give Credits
• Many slides are from John Kubiatowicz,
University of California at Berkeley
• I have modified them and added new
slides
2
Motivation
• Personal Information Mgmt is the Killer App
– Information management, analysis, aggregation,
dissemination, filtering for the individual
• Information Technology as a Utility
– Continuous service delivery, on a planetaryscale, on top of a highly dynamic information
base
3
OceanStore Context:
Ubiquitous Computing (I)
• Computing everywhere:
– Desktop, Laptop, Palmtop, Cars, Cellphones
– Shoes? Clothing? Walls?
• Connectivity everywhere:
– Rapid growth of bandwidth in the interior of the net
– Broadband to the home and office
– Wireless technologies such as CDMA, Satellite, laser
4
OceanStore Context:
Ubiquitous Computing (II)
• Rise of the thin-client metaphor:
– Services provided by interior of network
– Incredibly thin clients on the leaves
• MEMs devices -- sensors+CPU+wireless net in 1mm3
• Mobile society: people move and devices are
disposable
5
What do we need for personal
information management?
6
Questions about information:
• Where is persistent information stored?
– 20th-century tie between location and content outdated
• How is it protected?
– Can disgruntled employee of ISP sell your secrets?
– Can’t trust anyone (how paranoid are you?)
• Can we make it indestructible?
– Want our data to survive “the big one”!
– Highly resistant to hackers (denial of service)
– Wide-scale disaster recovery
• Is it hard to manage?
– Worst failures are human-related
– Want automatic (introspective) diagnosis and repair
7
First Observation:
Want Utility Infrastructure
• Mark Weiser from Xerox:
Transparent computing is the ultimate goal
– Computers should disappear into the background
• In storage context:
– Don’t want to worry about backup, obsolescence
– Need lots of resources to make data secure and
highly available, BUT don’t want to own them
– Outsourcing of storage already very popular
• Pay monthly fee and your “data is out there”
– Simple payment interface
 one bill from one company
8
Second Observation:
Need wide-scale deployment
• Many components with geographic separation
– System not disabled by natural disasters
– Can adapt to changes in demand and regional outages
• Wide-scale use and sharing also requires widescale deployment
– Bandwidth increasing rapidly, but latency bounded by
speed of light
• Handling many people with same system leads
to economies of scale
9
OceanStore:
Everyone’s data, One big Utility
“The data is just out there”
• Separate information from location
– Locality is only an optimization (an important one!)
– Wide-scale coding and replication for durability
• All information is globally identified
– Unique identifiers are hashes over names & keys
– Single uniform lookup interface
– No centralized namespace required
10
Amusing back of the envelope
calculation
(courtesy Bill Bolotsky, Microsoft)
• How many files in the OceanStore?
– Assume 1010 people in world
– Say 10,000 files/person (very conservative?)
– So 1014 files in OceanStore!
– If 1 gig files (not likely), get 1 mole of files!
Truly impressive number of elements…
… but small relative to physical constants
11
Utility-based Infrastructure
Canadian
OceanStore
Sprint
AT&T
Pac
Bell
IBM
IBM
• Service provided by confederation of companies
– Monthly fee paid to one service provider
– Companies buy and sell capacity from each other
12
Outline
• Motivation
• Properties of the OceanStore
• Specific Technologies and approaches:
–
–
–
–
–
Naming and Data Location
Conflict resolution on encrypted data
Replication and Deep archival storage
Introspective computing for optimization and repair
Economic models
• Conclusion
13
Ubiquitous Devices 
Ubiquitous Storage
• Consumers of data move, change from one
device to another, work in cafes, cars, airplanes,
the office, etc.
• Properties REQUIRED for OceanStore storage
substrate:
– Strong Security: data encrypted in the infrastructure;
resistance to monitoring and denial of service attacks
– Coherence: too much data for naïve users to keep
coherent “by hand”
– Automatic replica management and optimization: huge
quantities of data cannot be managed manually
– Simple and automatic recovery from disasters: probability
of failure increases with size of system
– Utility model: world-scale system requires cooperation 14
across administrative boundaries
OceanStore Technologies I:
Naming and Data Location
• Requirements:
– System-level names should help to authenticate data
– Route to nearby data without global communication
– Don’t inhibit rapid relocation of data
• OceanStore approach:
Two-level search with embedded routing
– Underlying namespace is flat and built from secure
cryptographic hashes (160-bit SHA-1)
– Search process combines quick, probabilistic search with
slower guaranteed search
15
Universal Location Facility
• Takes 160-bit unique identifier (GUID) and
Returns the nearest object that matches
Universal Name
Floating
Replica
Name OID
Version OID
Active Data
Global Object
Resolution
Root Structure
Update OID:
Archive versions:
Version OID1
Version OID2
Version OID3
Global Object
Resolution
Global Object
Resolution
Erasure
Coded:
Archival copy
or snapshot
Archival copy
or snapshot
Commit Checkpoint
Logs OID
Global Object
Resolution
Archival copy
or snapshot
16
Routing
Two-tiered approach
• Fast probabilistic routing algorithm
– Entities that are accessed frequently are likely
to reside close to where they are being used
(ensured by introspection)
Self-optimizing
• Slower, guaranteed hierarchical routing
method
17
Probabilistic Routing Algorithm
self-optimizing
01234 bit
on the depth of the
reliable factors
attenuated bloom filter
array
10
11100
1st
1st
1st 11011
10
11011
2nd
n1
n2
10101
Query for X (11010)
1st
2nd
3rd
11010
11011
11011
11100
11100
1st 00011
reliable factors
11000
00100
11010
100
100
X
Y
(0,1,3)
(0,1,4)
11010
11001
n3
n4
00011
11011
100
self-protecting
Bloom filter on each node;
Attenuated Bloom filter on each directed edge.
z
M
(1,3,4)
(0,2,4)
18
Hierarchical Routing Algorithm
• Based on Plaxton scheme
• Every server in the system is assigned a
random node-ID
• Object’s root
– each object is mapped to a single node whose nodeID matches the object’s GUID in the most bits
(starting from the least significant)
• Information about the GUID (such as location)
were stored at its root
19
Construct Plaxton Mesh
1
x431
1
0324
1
1
x633
x742
2
3714
2
1
2
0265
1215
3
2344
x927
4
5724
9834
3
1624
7144
1324
…
20
GUID
0x43FE
Basic Plaxton Mesh
Incremental suffix-based routing
3
4
NodeID
0x79FE
NodeID
0x23FE
NodeID
0x993E
NodeID
0x73FE
3
NodeID
0x44FE
2
3
NodeID
0x43FE
4
d
4
NodeID
0x035E
2
4
3
e
1
1
3
NodeID
0xF990
2
3
NodeID
0x555E
1
NodeID
0x73FF
2
NodeID c
0xABFE
NodeID
0x13FE
NodeID
0x423E
4
NodeID
0x9990
1
2
2
NodeID
0x04FE
b
3
NodeID
0x239E
1
NodeID
0x1290
a
21
Use of Plaxton Mesh
Randomization and Locality
22
OceanStore Enhancements of
the Plaxton Mesh
• Documents have multiple roots (Salted hash of
GUID)
• Each node has multiple neighbor links
• Searches proceed along multiple paths
– Tradeoff between reliability, performance and
bandwidth?
• Dynamic node insertion and deletion algorithms
– Continuous repair and incremental optimization of
links
self-healing
self-configuration
self-optimizing
23
OceanStore Technologies II:
Rapid Update in an
Untrusted Infrastructure
• Requirements:
– Scalable coherence mechanism which can operate directly
on encrypted data without revealing information
– Handle Byzantine failures
– Rapid dissemination of committed information
• OceanStore Approach:
– Operations-based interface using conflict resolution
• Modeled after Xerox Bayou  update packets include:
Predicate/update pairs which operate on encrypted data
– User signs Updates and principle party signs commits
– Committed data multicast to clients
24
Update Model
• Concurrent updates w/o wide-area locking
– Conflict resolution
• Updates Serialization
• A master replica?
• Role of primary tier of replicas
– All updates submitted to primary tier of replicas which
chooses a final total order by following Byzantine
agreement protocol
• A secondary tier of replicas
– The result of the updates is multicast down the
dissemination tree to all the secondary replicas
25
Tentative Updates:
Epidemic Dissemination
26
Committed Updates:
Multicast Dissemination
27
Data Coding Model
• Two distinct forms of data: active and archival
• Active Data in Floating Replicas
– Latest version of the object
• Archival Data in Erasure Coded Fragments
– A permanent, read-only version of the object
– During commit, previous version coded with erasure-code
and spread over 100s or 1000s of nodes
– Advantage: any 1/2 or 1/4 of fragments regenerates data
28
Floating Replica and
Deep Archival Coding
Full
Cop
y
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict Resolution
Logs
Full
Cop
y
Floating
Replica
Full
Cop
y
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict Resolution
Logs
Ver1: 0x34243
Ver2: 0x49873
Ver3: …
Conflict Resolution
Logs
29
Erasure-coded Fragments
Proactive Self-Maintenance
• Continuous testing and repair of information
– Slow sweep through all information to make sure there are
sufficient erasure-coded fragments
– Continuously reevaluate risk and redistribute data
– Slow sweep and repair of metadata/search trees
• Continuous online self-testing of HW and SW
– Detects flaky, failing, or buggy components via:
• fault injection: triggering hardware and software error
handling paths to verify their integrity/existence
• stress testing: pushing HW/SW components past normal
operating parameters
• scrubbing: periodic restoration of potentially “decaying”
hardware or software state
– Automates preventive maintenance
30
OceanStore Technologies IV:
Introspective Optimization
• Requirements:
– Reasonable job on global-scale optimization problem
• Take advantage of locality whenever possible
• Sensitivity to limited storage and bandwidth at endpoints
– Repair of data structures, increasing of redundancy
– Stability in chaotic environment  Active Feedback
• OceanStore Approach:
– Introspective monitoring and analysis of relationships to cluster
information by relatedness
– Time series-analysis of user and data motion
– Rearrangement and replication in response to monitoring
• Clustered prefetching: fetch related objects
31
• Proactive-prefetching: get data there before needed
Example: Client Introspection
• Client observer and optimizer components
– Greedy agents working on the behalf of the client
• Watches client activity/combines with historical info
• Performs clustering and time-series analysis
• Forwards results to infrastructure (privacy issues!)
– Monitoring state of network to adapt behaviour
• Typical Actions:
– Cluster related files together
– Prefetch files that will be needed soon
– Create/destroy floating replicas
32
OceanStore Conclusion
• The Time is now for a Universal Data Utility
– Ubiquitous computing and connectivity is (almost) here!
– Confederation of utility providers is right model
• OceanStore holds all data, everywhere
– Local storage is a cache on global storage
– Provides security in an untrusted infrastructure
– Large scale system has good statistical properties
• Use of introspection for performance and stability
• Quality of individual servers enhances reliability
• Exploits economies of scale to:
– Provide high-availability and extreme survivability
– Lower maintenance cost:
• self-diagnosis and repair
• Insensitivity to technology changes:
Just unplug one set of servers, plug in others
33
Download