Pastiche: Making Backup Cheap and Easy

advertisement
Pastiche: Making Backup
Cheap and Easy
Introduction




Backup is cumbersome and expensive
~$4/GB/Month
Small-scale solutions dominated by
administrative efforts
Large-scale solutions require centralized
management
Pastiche


Observation 1: disk is no longer full
Can use excess capacity for efficient,
effective, and administration-free backup



Use untrusted machines to perform backup
services
Need replication for reliability
Need to balance locality and reliability
Pastiche


Observation 2: Much of the data on a
given machine is not unique
Office 2000: 217 MB footprint


Different installations are largely the same
It’s exploitation can achieve storage savings
Pastiche

Built on three pieces of research



Pastry: Peer-to-peer, self-administering,
scalable routing
Content-based indexing: easy discovering of
redundant data
Convergent encryption: use the same
encrypted representation without sharing keys
Challenges




How to discover backup buddies without a
centralized directory?
How can nodes reuse their own state to
backup others?
How can nodes restore files/machines
without requiring administrative
intervention?
How can nodes detect unfaithful buddies?
Basic Idea




Summarize storage content with abstracts
Use abstracts to locate buddies
A skeleton tree is used to represent and
restore an entire file system
Periodic queries of buddies for stored data
Enabling Technologies



Peer-to-peer routing
Content-based indexing
Convergent encryption
Peer-to-Peer Routing


Pastry: scalable, self-organizing, routing
and object location infrastructure
Each node has a nodeID


IDs are uniformly distributed in the ID space
A proximity metric to measure the distance
between two IDs
More on Pastry

Each node maintains three sets of states

Leaf set


Neighborhood set


Closest nodes in terms of nodeIDs
Closest nodes in terms of of the proximity metric
Routing table

Prefix routing
Prefix Routing




In each step, a node forwards the
message to a node whose nodeID shares
a prefix that is at least one digit longer
than the prefix of the current nodeID
Destination: 1230
Current NodeID: 1023
Next Hop: 12--
Pastiche’s Use of Pastry



Uses two separate Pastry overlay networks
during buddy discovery
Once a node is discovered, traffic is send
directly via IP
Pastiche adds two mechanisms


Lighthouse sweep to discover distinct Pastry
nodes
Distance metric based on the FS contents
Content-Based Indexing


Goal: identify file regions for sharing
Use Rabin fingerprints



A fingerprint is generated for each
overlapping k-byte substring in a file
If the lower-order bits of a fingerprint match a
predetermined value, that offset is marked as
an anchor
Anchors divide files into chunks; each chunk is
associated with a secure hash value
Sharing with Confidentiality

Sharing encrypted data without sharing
keys



Need to have a single encrypted
representation
For the ease of comparisons
Use convergent encryption
Convergent Encryption

So…say…how do you share a door without
sharing its corresponding keys?
Convergent Encryption

How about use different safes to stores
those keys?
Convergent Encryption

And use different keys to access those
keys
Implications of the Use of
Convergent Encryption

If a backup node is not a participating
group


Cannot decrypt the data
If not, a backup node knows the node also
stores that data

Information leak vs. storage efficiency
Design

Pastiche data is stored in chunks



Chunk boundaries determined by contentbased indexing
Encrypted with convergent encryption
Chunks carry owner lists
Design

When a newly written file is closed, it is
scheduled for chunking



If a chunk already exists, the local host is
added to the owner list
If not, encrypt the chunk and write it out
Chunking and writing deferred to avoid
short-lived files
Design



Chunks are immutable
When a file is written, its set of chunk may
change
A chunk is not deleted until the last
reference to it is removed
Abstracts: Finding Redundancy




An ideal backup buddy is one that holds a
superset of the new machine’s data
To find it, send the full signature (hashes)
of the new node to candidate buddies
However, we need to transfer 1.3MB per
GB of stored data
Solution: Abstracts—transfer only a
random subset of signatures
Compare one disk to another
Node1 signature
Node2 signature
98
73
1
46
98
73
1
46
20
67
8
11
55
67
20
8
11
55
26
7
53
13
45
16
21
24
26
7
13
53
17
16
93
24
33
35
18
15
16
45
24
21
77
35
19
15
35
33
15
18
13
15
Node1 abstract
1
67
Overlays: Finding a Set of Buddies

A desirable buddy should have


A substantial overlap
Physically nearby (with at least one far away
to survive geographically correlated failures)
Applied Use of Pastry

Pastiche uses two Pastry overlays to
facilitate buddy discovery


One for network proximity
One for file system overlap

Coverage—the fraction of overlapping chunks
stored on a site
Security Problems

A malicious node can


Under-report coverage to avoid being chosen
as a buddy
Over-report coverage to attract clients just to
discard their chunks
Backup Protocol


A Pastiche node has full control over the
backup schedule
A snapshot consists of three things



Chunks to be added
Chunks to be removed
Metadata of those chunks
Restoration


A Pastiche node retains its archive
skeleton, so performing partial restores is
easy
To recover the whole machine, a node has
to obtain its root node from one of the
backup machines first…
Detecting Failure and Malice

A node randomly requests data from its
buddies

Can bound probability of having failures and
malicious nodes undetected
Preventing Greed


Someone can store things everywhere
Need to institute distributed quota


Very difficult
Some proposed solutions

Each node monitors the overall storage costs
imposed by its backup clients

Problem: Sybil attacks (forge many entities that
consumes little storage)
Preventing Greed

Force each node to solve puzzles
proportional to storage consumption

Problems:
Needless expensive
 Storage is traded against something other than
storage
 Heterogeneous computing power

Preventing Greed

Electronic currency

Problems:
Need to add atomic currency transactions
 Complicated

Implementation


Chunkstore file system
Backup daemon
Performance Overhead
Chunkstore
Ext2
60
Seconds
50
40
30
20
10
0
mkdir
cp
scandir
Phase
cat
make
total
The Chance of Finding Buddies
18
Expected # of Buddies
16
14
12
10
8
6
4
2
0
30%
20%
10%
5%
Installation Popularity
1%
Download