OceanStore Toward Global-Scale, Self-Repairing, Secure and Persistent Storage John Kubiatowicz

advertisement
OceanStore
Toward Global-Scale, Self-Repairing,
Secure and Persistent Storage
John Kubiatowicz
University of California at Berkeley
OceanStore Context:
Ubiquitous Computing
• Computing everywhere:
– Desktop, Laptop, Palmtop
– Cars, Cellphones
– Shoes? Clothing? Walls?
• Connectivity everywhere:
– Rapid growth of bandwidth in the interior of the net
– Broadband to the home and office
– Wireless technologies such as CMDA, Satelite, laser
• Where is persistent data????
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:2
Utility-based Infrastructure?
Canadian
OceanStore
Sprint
AT&T
Pac IBM
Bell
IBM
• Data service provided by storage federation
• Cross-administrative domain
• Pay for Service
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:3
OceanStore:
Everyone’s Data, One Big Utility
“The data is just out there”
• How many files in the OceanStore?
– Assume 1010 people in world
– Say 10,000 files/person (very conservative?)
– So 1014 files in OceanStore!
– If 1 gig files (ok, a stretch), get 1 mole of bytes!
Truly impressive number of elements…
… but small relative to physical constants
Aside: new results: 1.5 Exabytes/year (1.51018)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:4
OceanStore Assumptions
• Untrusted Infrastructure:
– The OceanStore is comprised of untrusted components
– Individual hardware has finite lifetimes
– All data encrypted within the infrastructure
• Responsible Party:
– Some organization (i.e. service provider) guarantees
that your data is consistent and durable
– Not trusted with content of data, merely its integrity
• Mostly Well-Connected:
– Data producers and consumers are connected to a highbandwidth network most of the time
– Exploit multicast for quicker consistency when possible
• Promiscuous Caching:
– Data may be cached anywhere, anytime
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:5
The Peer-To-Peer View:
Irregular Mesh of “Pools”
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:6
Key Observation:
Want Automatic Maintenance
• Can’t possibly manage billions of servers by hand!
• System should automatically:
–
–
–
–
Adapt to failure
Exclude malicious elements
Repair itself
Incorporate new elements
• System should be secure and private
– Encryption, authentication
• System should preserve data over the long term
(accessible for 1000 years):
–
–
–
–
Geographic distribution of information
New servers added from time to time
Old servers removed from time to time
Everything just works
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:7
Outline:
Three Technologies and
a Principle
Principle: ThermoSpective Systems Design
–
Redundancy and Repair everywhere
1. Structured, Self-Verifying Data
–
Let the Infrastructure Know What is important
–
A new abstraction for routing
–
Long Term Durability
2. Decentralized Object Location and Routing
3. Deep Archival Storage
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:8
ThermoSpective
Systems
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:9
Goal: Stable, large-scale systems
• State of the art:
– Chips: 108 transistors, 8 layers of metal
– Internet: 109 hosts, terabytes of bisection bandwidth
– Societies: 108 to 109 people, 6-degrees of separation
• Complexity is a liability!
–
–
–
–
More components  Higher failure rate
Chip verification > 50% of design team
Large societies unstable (especially when centralized)
Small, simple, perfect components combine to generate
complex emergent behavior!
• Can complexity be a useful thing?
– Redundancy and interaction can yield stable behavior
– Better figure out new ways to design things…
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:10
Question: Can we exploit Complexity
to our Advantage?
Moore’s Law gains  Potential for Stability
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:11
The Thermodynamic Analogy
• Large Systems have a variety of latent order
– Connections between elements
– Mathematical structure (erasure coding, etc)
– Distributions peaked about some desired behavior
• Permits “Stability through Statistics”
– Exploit the behavior of aggregates (redundancy)
• Subject to Entropy
– Servers fail, attacks happen, system changes
• Requires continuous repair
– Apply energy (i.e. through servers) to reduce entropy
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:12
The Biological Inspiration
• Biological Systems are built from (extremely)
faulty components, yet:
– They operate with a variety of component failures
 Redundancy of function and representation
– They have stable behavior  Negative feedback
– They are self-tuning  Optimization of common case
• Introspective (Autonomic)
Computing:
– Components for performing
Dance
– Components for monitoring and
model building
– Components for continuous
adaptation
Adapt Monitor
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:13
Dance
Adapt Monitor
ThermoSpective
• Many Redundant Components (Fault Tolerance)
• Continuous Repair (Entropy Reduction)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:14
Object-Based
Storage
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:15
Let the Infrastructure help You!
• End-to-end and everywhere else:
– Must distribute responsibility to guarantee:
QoS, Latency, Availability, Durability
• Let the infrastructure understand the
vocabulary or semantics of the application
– Rules of correct interaction?
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:16
OceanStore Data Model
• Versioned Objects
– Every update generates a new version
– Can always go back in time (Time Travel)
• Each Version is Read-Only
– Can have permanent name
– Much easier to repair
• An Object is a signed mapping between
permanent name and latest version
– Write access control/integrity involves managing
these mappings
versions
Comet Analogy
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
updates
OceanStore:17
Secure Hashing
DATA
SHA-1
160-bit GUID
• Read-only data: GUID is hash over actual data
– Uniqueness and Unforgeability: the data is what it is!
– Verification: check hash over data
• Changeable data: GUID is combined hash over a
human-readable name + public key
– Uniqueness: GUID space selected by public key
– Unforgeability: public key is indelibly bound to GUID
• Thermodynamic insight: Hashing makes
“data particles” unique, simplifying interactions
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:18
Self-Verifying Objects
AGUID = hash{name+keys}
VGUIDi
Data
BTree
VGUIDi + 1
backpointe
r
M
M
copy on
write
Indirect
Blocks
copy on
write
d1
d2
d3
Data
Blocks
d4 d5 d6
d7
d8
d'8 d'9
d9
Heartbeat: {AGUID,VGUID, Timestamp}signed
Heartbeats +
Read-Only Data
MIT Seminar
Updates
©2002 John Kubiatowicz/UC Berkeley
OceanStore:19
Second-Tier
Caches
The Path of an
OceanStore Update
Inner-Ring
Servers
Clients
Multicast
trees
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:20
OceanStore Consistency via
Conflict Resolution
• Consistency is form of optimistic concurrency
– An update packet contains a series of predicate-action
pairs which operate on encrypted data
– Each predicate tried in turn:
• If none match, the update is aborted
• Otherwise, action of first true predicate is applied
• Role of Responsible Party
– All updates submitted to Responsible Party which
chooses a final total order
– Byzantine agreement with threshold signatures
• This is powerful enough to synthesize:
– ACID database semantics
– release consistency (build and use MCS-style locks)
– Extremely loose (weak) consistency
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:21
Self-Organizing Soft-State
Replication
• Simple algorithms for placing replicas on nodes
in the interior
– Intuition: locality properties
of Tapestry help select positions
for replicas
– Tapestry helps associate
parents and children
to build multicast tree
• Preliminary results
show that this is effective
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:22
Decentralized
Object Location
and Routing
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:23
Locality, Locality, Locality
One of the defining principles
• The ability to exploit local resources over remote
ones whenever possible
• “-Centric” approach
– Client-centric, server-centric, data source-centric
• Requirements:
– Find data quickly, wherever it might reside
• Locate nearby object without global communication
• Permit rapid object migration
– Verifiable: can’t be sidetracked
• Data name cryptographically related to data
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:24
Enabling Technology: DOLR
(Decentralized Object Location and Routing)
GUID1
DOLR
GUID2
MIT Seminar
GUID1
©2002 John Kubiatowicz/UC Berkeley
OceanStore:25
Basic Tapestry Mesh
Incremental Prefix-based Routing
3
4
NodeID
0xEF97
NodeID
0xEF32
NodeID
0xE399
4
NodeID
0xE530
3
4
3
1
1
3
NodeID
0x099F
2
3
NodeID
0xE555
1
NodeID
0xFF37
MIT Seminar
NodeID
0xEF34
4
NodeID
0xEF37
3
NodeID
0xEF44
2
2
2
NodeID
0xEFBA
NodeID
0xEF40
NodeID
0xEF31
4
NodeID
0x0999
1
2
2
3
NodeID
0xE324
NodeID
0xE932
©2002 John Kubiatowicz/UC Berkeley
1
NodeID
0x0921
OceanStore:26
Use of Tapestry Mesh
Randomization and Locality
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:27
Stability under Faults
• Instability is the common case….!
– Small half-life for P2P apps (1 hour????)
– Congestion, flash crowds, misconfiguration, faults
• Must Use DOLR under instability!
– The right thing must just happen
• Tapestry is natural framework to exploit
redundant elements and connections
– Multiple Roots, Links, etc.
– Easy to reconstruct routing and location information
– Stable, repairable layer
• Thermodynamic analogies:
– Heat Capacity of DOLR network
– Entropy of Links (decay of underlying order)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:28
Single Node Tapestry
Application-Level
Multicast
OceanStore
Other
Applications
Application Interface / Upcall API
Dynamic Node
Management
Routing Table
&
Router
Object Pointer DB
Network Link Management
Transport Protocols
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:29
It’s Alive!
• Planet Lab global network
– 98 machines at 42 institutions, in North America,
Europe, Australia (~ 60 machines utilized)
– 1.26Ghz PIII (1GB RAM), 1.8Ghz PIV (2GB RAM)
– North American machines (2/3) on Internet2
• Tapestry Java deployment
–
–
–
–
MIT Seminar
6-7 nodes on each physical machine
IBM Java JDK 1.30
Node virtualization inside JVM and SEDA
Scheduling between virtual nodes increases latency
©2002 John Kubiatowicz/UC Berkeley
OceanStore:30
Object Location
RDP (min, median, 90%)
25
20
15
10
5
0
0
20
40
60
80
100
120
140
160
180
200
Client to Obj RTT Ping time (1ms buckets)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:31
Tradeoff: Storage vs Locality
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:32
Management Behavior
1.4
2000
1.2
Control Traffic BW (KB)
Integration Latency (ms)
1800
1600
1400
1200
1000
800
600
1
0.8
0.6
0.4
400
0.2
200
0
0
0
100
200
300
400
500
0
50
100
Size of Existing Network (nodes)
150
200
250
300
350
400
Size of Existing Network (nodes)
• Integration Latency (Left)
– Humps = additional levels
• Cost/node Integration Bandwidth (right)
– Localized!
• Continuous, multi-node insertion/deletion works!
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:33
Deep Archival Storage
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:34
Two Types of OceanStore Data
• Active Data: “Floating Replicas”
– Per object virtual server
– Interaction with other replicas for consistency
– May appear and disappear like bubbles
• Archival Data: OceanStore’s Stable Store
– m-of-n coding: Like hologram
• Data coded into n fragments, any m of which are
sufficient to reconstruct (e.g m=16, n=64)
• Coding overhead is proportional to nm (e.g 4)
• Other parameter, rate, is 1/overhead
– Fragments are cryptographically self-verifying
• Most data in the OceanStore is archival!
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:35
Archival Dissemination
of Fragments
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:36
Fraction of Blocks Lost
per Year (FBLPY)
• Exploit law of large numbers for durability!
• 6 month repair, FBLPY:
– Replication: 0.03
– Fragmentation: 10-35
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:37
The Dissemination Process:
Achieving Failure Independence
Model Builder
Human Input
Set Creator
probe
type
fragments
Inner Ring
Inner Ring
fragments
fragments
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:38
Independence Analysis
• Information gathering:
– State of fragment servers (up/down/etc)
• Correllation analysis:
– Use metric such as mutual information
– Cluster via that metric
– Result partitions servers into uncorrellated clusters
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:39
Active Data Maintenance
L3
4577
9098
L2
L2
1167
0128
L2
L3
L1
L1
AE87
L3
3213
L1
L1
6003
L2
5544
L2
3274
L2
L2
Ring of L1
Heartbeats
• Tapestry enables “data-driven multicast”
– Mechanism for local servers to watch each other
– Efficient use of bandwidth (locality)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:40
1000-Year Durability?
• Exploiting Infrastructure for Repair
– DOLR permits efficient heartbeat mechanism to
notice:
• Servers going away for a while
• Or, going away forever!
– Continuous sweep through data also possible
– Erasure Code provides Flexibility in Timing
• Data continuously transferred from physical
medium to physical medium
– No “tapes decaying in basement”
– Information becomes fully Virtualized
• Thermodynamic Analogy: Use of Energy (supplied
by servers) to Suppress Entropy
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:41
PondStore
Prototype
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:42
First Implementation [Java]:
• Event-driven state-machine model
– 150,000 lines of Java code and growing
• Included Components
 DOLR Network (Tapestry)
• Object location with Locality
• Self Configuring, Self R epairing
 Full Write path
• Conflict resolution and Byzantine agreement
 Self-Organizing Second Tier
• Replica Placement and Multicast Tree Construction
 Introspective gathering of tacit info and adaptation
• Clustering, prefetching, adaptation of network routing
 Archival facilities
• Interleaved Reed-Solomon codes for fragmentation
• Independence Monitoring
• Data-Driven Repair
• Downloads available from www.oceanstore.org
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:43
Event-Driven Architecture
of an OceanStore Node
• Data-flow style
World
– Arrows Indicate flow of messages
• Potential to exploit small multiprocessors at
each physical node
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:44
First Prototype Works!
• Latest: it is up to 8MB/sec (local area network)
– Biggest constraint: Threshold Signatures
• Still a ways to go, but working
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:45
Update Latency
• Cryptography in critical path (not surprising!)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:46
Working Applications
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:47
MINO: Wide-Area E-Mail Service
Local network
Replicas
Internet
Replicas
Traditional
Mail Gateways
OceanStore Client API
Mail Object Layer
IMAP
Proxy
Client
MIT Seminar
SMTP
Proxy
• Complete mail solution
– Email inbox
– Imap folders
OceanStore Objects
©2002 John Kubiatowicz/UC Berkeley
OceanStore:48
Riptide:
Caching the Web with OceanStore
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:49
Other Apps
• Long-running archive
– Project Segull
• File system support
– NFS with time travel (like VMS)
– Windows Installable file system (soon)
• Anonymous file storage:
– Nemosyne uses Tapestry by itself
• Palm-pilot synchronization
– Palm data base as an OceanStore DB
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:50
Conclusions
• Exploitation of Complexity
– Large amounts of redundancy and connectivity
– Thermodynamics of systems:
“Stability through Statistics”
– Continuous Introspection
• Help the Infrastructure to Help you
–
–
–
–
Decentralized Object Location and Routing (DOLR)
Object-based Storage
Self-Organizing redundancy
Continuous Repair
• OceanStore properties:
– Provides security, privacy, and integrity
– Provides extreme durability
– Lower maintenance cost through redundancy,
continuous adaptation, self-diagnosis and repair
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:51
For more info:
http://oceanstore.org
• OceanStore vision paper for ASPLOS 2000
“OceanStore: An Architecture for Global-Scale
Persistent Storage”
• Tapestry algorithms paper (SPAA 2002):
“Distributed Object Location in a Dynamic Network”
• Bloom Filters for Probabilistic Routing
(INFOCOM 2002):
“Probabilistic Location and Routing”
• Upcoming CACM paper (not until February):
– “Extracting Guarantees from Chaos”
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:52
Backup Slides
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:53
Secure Naming
Out-of-Band
“Root link”
Foo
Bar
Baz
Myfile
• Naming hierarchy:
– Users map from names to GUIDs via hierarchy of
OceanStore objects (ala SDSI)
– Requires set of “root keys” to be acquired by user
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:54
Self-Organized
Replication
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:55
Effectiveness of second tier
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:56
Second Tier Adaptation:
Flash Crowd
• Actual Web Cache running on OceanStore
– Replica 1 far away
– Replica 2 close to most requestors (created t ~ 20)
– Replica 3 close to rest of requestors (created t ~ 40)
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:57
Introspective Optimization
• Secondary tier self-organized into
overlay multicast tree:
– Presence of DOLR with locality to suggest placement
of replicas in the infrastructure
– Automatic choice between update vs invalidate
• Continuous monitoring of access patterns:
– Clustering algorithms to discover object relationships
• Clustered prefetching: demand-fetching related objects
• Proactive-prefetching: get data there before needed
– Time series-analysis of user and data motion
• Placement of Replicas to Increase Availability
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:58
Statistical Advantage of Fragments
Time to Coalesce vs. Fragments Requested (TI5000)
180
160
140
Latency
120
100
80
60
40
20
0
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Objects Requested
• Latency and standard deviation reduced:
– Memory-less latency model
– Rate ½ code with 32 total fragments
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:59
Parallel Insertion Algorithms
(SPAA ’02)
• Massive parallel insert is important
– We now have algorithms that handle “arbitrary
simultaneous inserts”
– Construction of nearest-neighbor mesh links
• Log2 n message complexityfully operational routing
mesh
– Objects kept available during this process
• Incremental movement of pointers
• Interesting Issue: Introduction service
– How does a new node find a gateway into the
Tapestry?
MIT Seminar
©2002 John Kubiatowicz/UC Berkeley
OceanStore:60
Can You Delete (Eradicate) Data?
• Eradication is antithetical to durability!
– If you can eradicate something, then so can someone else!
(denial of service)
– Must have “eradication certificate” or similar
• Some answers:
– Bays: limit the scope of data flows
– Ninja Monkeys: hunt and destroy with certificate
• Related: Revocation of keys
– Need hunt and re-encrypt operation
• Related: Version pruning
–
–
–
–
MIT Seminar
Temporary files: don’t keep versions for long
Streaming, real-time broadcasts: Keep? Maybe
Locks: Keep? No, Yes, Maybe (auditing!)
Every key stroke made: Keep? For a short while?
©2002 John Kubiatowicz/UC Berkeley
OceanStore:61
Download