Pastiche: Making Backup Cheap and Easy Introduction Backup is cumbersome and expensive ~$4/GB/Month Small-scale solutions dominated by administrative efforts Large-scale solutions require centralized management Pastiche Observation 1: disk is no longer full Can use excess capacity for efficient, effective, and administration-free backup Use untrusted machines to perform backup services Need replication for reliability Need to balance locality and reliability Pastiche Observation 2: Much of the data on a given machine is not unique Office 2000: 217 MB footprint Different installations are largely the same It’s exploitation can achieve storage savings Pastiche Built on three pieces of research Pastry: Peer-to-peer, self-administering, scalable routing Content-based indexing: easy discovering of redundant data Convergent encryption: use the same encrypted representation without sharing keys Challenges How to discover backup buddies without a centralized directory? How can nodes reuse their own state to backup others? How can nodes restore files/machines without requiring administrative intervention? How can nodes detect unfaithful buddies? Basic Idea Summarize storage content with abstracts Use abstracts to locate buddies A skeleton tree is used to represent and restore an entire file system Periodic queries of buddies for stored data Enabling Technologies Peer-to-peer routing Content-based indexing Convergent encryption Peer-to-Peer Routing Pastry: scalable, self-organizing, routing and object location infrastructure Each node has a nodeID IDs are uniformly distributed in the ID space A proximity metric to measure the distance between two IDs More on Pastry Each node maintains three sets of states Leaf set Neighborhood set Closest nodes in terms of nodeIDs Closest nodes in terms of of the proximity metric Routing table Prefix routing Prefix Routing In each step, a node forwards the message to a node whose nodeID shares a prefix that is at least one digit longer than the prefix of the current nodeID Destination: 1230 Current NodeID: 1023 Next Hop: 12-- Pastiche’s Use of Pastry Uses two separate Pastry overlay networks during buddy discovery Once a node is discovered, traffic is send directly via IP Pastiche adds two mechanisms Lighthouse sweep to discover distinct Pastry nodes Distance metric based on the FS contents Content-Based Indexing Goal: identify file regions for sharing Use Rabin fingerprints A fingerprint is generated for each overlapping k-byte substring in a file If the lower-order bits of a fingerprint match a predetermined value, that offset is marked as an anchor Anchors divide files into chunks; each chunk is associated with a secure hash value Sharing with Confidentiality Sharing encrypted data without sharing keys Need to have a single encrypted representation For the ease of comparisons Use convergent encryption Convergent Encryption So…say…how do you share a door without sharing its corresponding keys? Convergent Encryption How about use different safes to stores those keys? Convergent Encryption And use different keys to access those keys Implications of the Use of Convergent Encryption If a backup node is not a participating group Cannot decrypt the data If not, a backup node knows the node also stores that data Information leak vs. storage efficiency Design Pastiche data is stored in chunks Chunk boundaries determined by contentbased indexing Encrypted with convergent encryption Chunks carry owner lists Design When a newly written file is closed, it is scheduled for chunking If a chunk already exists, the local host is added to the owner list If not, encrypt the chunk and write it out Chunking and writing deferred to avoid short-lived files Design Chunks are immutable When a file is written, its set of chunk may change A chunk is not deleted until the last reference to it is removed Abstracts: Finding Redundancy An ideal backup buddy is one that holds a superset of the new machine’s data To find it, send the full signature (hashes) of the new node to candidate buddies However, we need to transfer 1.3MB per GB of stored data Solution: Abstracts—transfer only a random subset of signatures Compare one disk to another Node1 signature Node2 signature 98 73 1 46 98 73 1 46 20 67 8 11 55 67 20 8 11 55 26 7 53 13 45 16 21 24 26 7 13 53 17 16 93 24 33 35 18 15 16 45 24 21 77 35 19 15 35 33 15 18 13 15 Node1 abstract 1 67 Overlays: Finding a Set of Buddies A desirable buddy should have A substantial overlap Physically nearby (with at least one far away to survive geographically correlated failures) Applied Use of Pastry Pastiche uses two Pastry overlays to facilitate buddy discovery One for network proximity One for file system overlap Coverage—the fraction of overlapping chunks stored on a site Security Problems A malicious node can Under-report coverage to avoid being chosen as a buddy Over-report coverage to attract clients just to discard their chunks Backup Protocol A Pastiche node has full control over the backup schedule A snapshot consists of three things Chunks to be added Chunks to be removed Metadata of those chunks Restoration A Pastiche node retains its archive skeleton, so performing partial restores is easy To recover the whole machine, a node has to obtain its root node from one of the backup machines first… Detecting Failure and Malice A node randomly requests data from its buddies Can bound probability of having failures and malicious nodes undetected Preventing Greed Someone can store things everywhere Need to institute distributed quota Very difficult Some proposed solutions Each node monitors the overall storage costs imposed by its backup clients Problem: Sybil attacks (forge many entities that consumes little storage) Preventing Greed Force each node to solve puzzles proportional to storage consumption Problems: Needless expensive Storage is traded against something other than storage Heterogeneous computing power Preventing Greed Electronic currency Problems: Need to add atomic currency transactions Complicated Implementation Chunkstore file system Backup daemon Performance Overhead Chunkstore Ext2 60 Seconds 50 40 30 20 10 0 mkdir cp scandir Phase cat make total The Chance of Finding Buddies 18 Expected # of Buddies 16 14 12 10 8 6 4 2 0 30% 20% 10% 5% Installation Popularity 1%