FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment Introduction Farsite: serverless distributed file system Logically functions as a centralized file server Designed for desktop environments Need some effort for initial configurations With little central administration to maintain Farsite Characteristics Peer-to-peer among untrusted machines Need to handle privacy, integrity, durability Cryptography Randomized replication Byzantine fault-tolerance Farsite Workloads High access locality Low update rate Sequential accesses with rare concurrency Administration Machine certificates bind machines to their public keys User certificates bind users to their public keys Namespace certificates bind namespace roots to their managing machines Design Assumptions ~105 machines All interconnected by a high-bandwidth, low-latency network Majority of machines to be up most of the time Uncorrelated permanent machine failures Read-mostly sharing Few malicious users for Enabling Technology Trends Increase in unused disk capacity In 2000, 58% of disk capacity unused at Microsoft Can replicate data for reliability Decrease in the computational cost Can easily encrypt at 53 MB/sec Disk transfers at 32 MB/sec Can use strong cryptography for security Namespace Roots Allow multiple roots for multiple machines Trust and Certification Based on public-key-cryptographic certificates Encrypt(Keypublic, textplain) textcipher Decrypt(Keyprivate, textcipher) textplain Encrypt(Keyprivate, textplain) textcipher Decrypt(Keypublic, textcipher) textplain Public Key Encryption Basics Idea Public key is published Private key is the secret Encrypt(Keymy_public, Anyone can create it, but only I can read it Encrypt(Keymy_private, “Hi, Andy”) “I’m Andy”) Everyone can read it, but only I can create it Public Key Encryption Basics Encrypt(Keyyour_public, Encrypt(Keymy_private, “I know your secret”)) Only you can read it, and only I can send it Basic System Every machine has three roles Client • A machine that interacts with a user Directory group • A set of machines that manage files via Byzantinefault-tolerant protocol • Every group member owns a replica File host More on the Basic System + Reliability + Data integrity - Performance Byzantine’s algorithm can only tolerate up to 1/3 of failed replicas Need lots of replicas - Privacy - Storage consumption System Enhancements Local caching A client can lease a copy of a file Encrypt written files with public keys of all authorized clients Offload those files to file hosts Store only the content hash of those files locally Can validate damaged copies Can tolerate n – 1 file host failures Traditional Byzantine Approach [CL99] Client Byzantine faulttolerant protocol File Meta-Data 3f +1 file copies to handle f failures Byzantine servers Farsite: BFT only for meta-data Client Byzantine faulttolerant protocol f + 1 file copies for f failures File hosts Directory group Semantic Differences from NTFS Hard limit on concurrent writes Soft limit on concurrent read Sometime supply stale snapshots No name-locking on open file’s path File System Features Reliability Availability Security Durability Consistency Scalability Efficiency Manageability Reliability and Availability Replication When a machine in unavailable for an extended period Its functions migrate to others Caching Privacy File content and metadata are encrypted Convergent encryption Encrypt(Hashone_way(blockplain), blockplain) blockcipher Data blocks Hash Encrypt More on Convergent Encryption Block hashes are used to identify identical block contents Block-level encryption allows block-level changes without re-encrypting the entire file More on Convergent Encryption Encrypt(Keyfile, file_hashesplain) file_hashescipher Encrypt Block hashes More on Convergent Encryption Encrypt(Keyclient1_public, Keyfile) Keyfile_cipher1 Encrypt(Keyclient2_public, Keyfile) Keyfile_cipher2 … Store both encrypted file and keys Directories Also encrypted Use exclusive encryption Prevent malicious client from encrypting a syntactically illegal name Integrity Use hash trees to compare files If the root matches, two files are identical If not, compare the hashes at the lower level Until the discrepancy is identified The cost of in-place updates is logarithmic of the file size Linear time to verify the integrity of individual blocks Durability Updates are logged and compressed locally The log is pushed back to the directory group periodically and when a lease is recalled Each log entry is verified Consistency Control can be loaned to clients Content leases Name leases Mode leases Access leases Data Consistency Content leases Read/write Read-only • Assures no stale data Single-writer, multiple-reader semantics A lease is kept until it is expired or recalled Can lease a file, directory, a tree Namespace Consistency Name leases Can create a file name Can create a directory and its files and subdirectories Windows File-Sharing Semantics Mode leases Read, write, delete, exclude-read, excludewrite, exclude-delete Windows Deletion Semantics Open it, mark it for deletion, close it A file is not deleted until the last file close Access leases Public: Lease holder has the file open Protected • No other client will be granted access without first contacting the lease holder Private • No other client has any access lease on the file Scalability Hint-based pathname translation Caching Delayed directory-change notification Space Efficiency Reclaim space from duplicate files Workgroup-shared documents Multiple copies of common applications Can save 50% of storage requirement Based on hash comparisons Time Efficiency Insert a delay between a file creation and replication Expect many files get deleted shortly after their creation Reduced network traffic Local-Machine Administration Machine replacement A special case of hardware failure Little need for backup Performance Measurements Used only five machines… With only 1 hour of file-system trace 2 450,164 file operations to 4 times as long as NTFS reads/writes/closes 9 times as long for opens 20 times as long for metadata accesses 5.5 times slower I/O latencies