Large Scale Sharing GFS and PAST Mahesh Balakrishnan Distributed File Systems Traditional Definition: Data and/or metadata stored at remote locations, accessed by client over the network. Various degrees of centralization: from NFS to xFS. GFS and PAST Unconventional, specialized functionality Large-scale in data and nodes The Google File System Specifically designed for Google’s backend needs Web Spiders append to huge files Application data patterns: Multiple producer – multiple consumer Many-way merging GFS Traditional File Systems Design Space Coordinates Commodity Components Very large files – Multi GB Large sequential accesses Co-design of Applications and File System Supports small files, random access writes and reads, but not efficiently GFS Architecture Interface: Usual: create, delete, open, close, etc Special: snapshot, record append Files divided into fixed size chunks Each chunk replicated at chunkservers Single master maintains metadata Master, Chunkservers, Clients: Linux workstations, user-level process Client File Request Client finds chunkid for offset within file Client sends <filename, chunkid> to Master Master returns chunk handle and chunkserver locations Design Choices: Master Single master maintains all metadata … Simple Design Global decision making for chunk replication and placement Bottleneck? Single Point of Failure? Design Choices: Master Single master maintains all metadata … in memory! Fast master operations Allows background scans of entire data Memory Limit? Fault Tolerance? Relaxed Consistency Model File Regions are Consistent: All clients see the same thing Defined: After mutation, all clients see exactly what the mutation wrote Ordering of Concurrent Mutations – For each chunk’s replica set, Master gives one replica primary lease Primary replica decides ordering of mutations and sends to other replicas Anatomy of a Mutation 1 2 Client gets chunkserver locations from master 3 Client pushes data to replicas, in a chain 4 Client sends write request to primary; primary assigns sequence number to write and applies it 5 6 Primary tells other replicas to apply write 7 Primary replies to client Connection with Consistency Model Secondary replica encounters error while applying write (step 5): region Inconsistent. Client code breaks up single large write into multiple small writes: region Consistent, but Undefined. Special Functionality Atomic Record Append Primary appends to itself, then tells other replicas to write at that offset If secondary replica fails to write data (step 5), duplicates in successful replicas, padding in failed ones region defined where append successful, inconsistent where failed Snapshot Copy-on-write: chunks copied lazily to same replica Master Internals Namespace management Replica Placement Chunk Creation, Re-replication, Rebalancing Garbage Collection Stale Replica Detection Dealing with Faults High availability Fast master and chunkserver recovery Chunk replication Master state replication: read-only shadow replicas Data Integrity Chunk broken into 64KB blocks, with 32 bit checksum Checksums stored in memory, logged to disk Optimized for appends, since no verifying required Micro-benchmarks Storage Data for ‘real’ clusters Performance Workload Breakdown % of operations for given size % of bytes transferred for given operation size GFS: Conclusion Very application-specific: more engineering than research PAST Internet-based P2P global storage utility Strong persistence High availability Scalability Security Not a conventional FS Files have unique id Clients can insert and retrieve files Files are immutable PAST Operations Nodes have random unique nodeIds No searching, directory lookup, key distribution Supported Operations: Insert: (name, key, k, file) fileId Stores on k nodes closest in id space Lookup: (fileId) file Reclaim: (fileId, key) Pastry P2P routing substrate route (key, msg) : routes to numerically closest node in less than log2b N steps Routing Table Size: (2b - 1) * log2b N + 2l b : determines tradeoff between per node state and lookup order l : failure tolerance: delivery guaranteed unless l/2 adjacent nodeIds fail 10233102: Routing Table |L|/2 larger and |L|/2 smaller nodeIds Routing Entries |M| closest nodes PAST operations/security Insert: Certificate created with fileId, file content hash, replication factor and signed with private key File and certificate routed through Pastry First node in k closest accepts file and forwards to other k-1 Security: Smartcards Public/Private key Generate and verify certificates Ensure integrity of nodeId and fileId assignments Storage Management Design Goals High global storage utilization Graceful degradation near max utilization PAST tries to: Balance free storage space amongst nodes Maintain k closest nodes replication invariant Storage Load Imbalance Variance in number of files assigned to node Variance in size distribution of inserted files Variance in storage capacity of PAST nodes Storage Management Large capacity storage nodes have multiple nodeIds Replica Diversion If node A cannot store file, it stores pointer to file at leaf set node B which is not in k closest What if A or B fail? Duplicate pointer in k+1 closest node Policies for directing and accepting replicas: tpri and tdiv thresholds for file size / free space. File Diversion If insert fails, client retries with different fileId Storage Management Maintaining replication invariant Failures and joins Caching k-replication in PAST for availability Extra copies stored to reduce client latency, network traffic Unused disk space utilized Greedy Dual-Size replacement policy Performance Workloads: 8 Web Proxy Logs Combined file systems k=5, b=4 # of nodes = 2250 Without replica and file diversion: 51.1% insertions failed 60.8% global utilization 4 normal distributions of node storage sizes Effect of Storage Management Effect of tpri Lower tpri: Better utilization, More failures tdiv = 0.05 tpri varied Effect of tdiv Trend similar to tpri tpri = 0.1 tdiv varied File and Replica Diversions Ratio of replica diversions vs utilization Ratio of file diversions vs utilization Distribution of Insertion Failures File system trace Web logs trace Caching Conclusion