A LOW-BANDWIDTH NETWORK FILE SYSTEM A. Muthitacharoen, MIT B. Chen, MIT D. Mazieres, New York U Highlights • A file system for slow or wide-area networks • Exploits similarities between files or versions of the same file – Avoids sending data that can be found in the server’s file system or the client’s cache • Also uses conventional compression and caching • Requires 90% less bandwidth than traditional network file systems Working on slow networks • Can work with local copies – Must then worry about update conflicts • Can use remote login – Only for text-based applications • Should use instead a low-bandwidth file system – Better than remote login – Must then deal with issues like big autosaves blocking the editor for the duration of transfer LBFS (I) • Client keeps all recently accessed files in its cache • LBFS exploits cross file similarities to reduce data transfers between client and server – File server divides the file it stores into variable-size chunks – Indexes these chunks by their hash values LBFS (II) • When transferring a file between the client and the server – LBFS identifies the chunks the receiving side already has – Only transmits the other chunks • Provides close-to-open consistency – Same as Coda (and newer versions of NFS) Related work (I) • AFS used callbacks to reduce network traffic • Leases are callbacks with expiration date • Coda supports slow networks and disconnected operations through optimistic replication • Bayou and OceanStore investigate conflict resolution for optimistic updates • Lee et al. have extended Coda to support operation-based updates Related Work (II) • Spring and Wetherall use large client and server caches to eliminate redundant network traffic: – Can send address of data already in cache of receiver rather than data themselves • Rsync exploits similarities between directory trees containing similar subtrees LBFS Design • Key ideas: – Close-to-open consistency – Have a large persistent file cache at client • IDE disks are now large enough for that – Exploits similarities between files (and file versions) • Only transmits data chunks containing new data Identifying Similar Data Chunks • LBFS uses collision-resistant property of SHA-1 hash function – Assumes no hash collisions • Central challenge is – Keeping the index a reasonable size – Dealing with shifting offsets The Case against Fixed-Size Blocks File F File F after an insertion The two files do not have a single block in common The Case against “Diffs” • “Diffs” are used by several UNIX utilities – Computed by comparing contents of file with another file – Very efficient • Must know which file(s) to compare to • Difficult in a file system – Obscure naming of editor buffer files and other temp files Dividing Files into Chunks • LBFS – Only looks for non-overlapping chunks in files – Sets chunk boundaries based on file contents • To divide a file into chunks, LBFS – Examines every (overlapping) 48-byte region of the file – Uses Rabin’s fingerprints to select boundary regions or breakpoints Using Rabin’s Fingerprints • Polynomial representation of data in 48-byte region modulo an irreducible polynomial • Boundary regions have the 13 least significant bits of their fingerprint equal to an arbitrary predefined value – Assuming random data, expected chunk size is 213 = 8K • Method is reasonably fast How it works A file X partitioned into three chunks Same file X after one insertion inside middle chunk New Chunk Chunk boundaries are arbitrary and identified by the content of their boundary regions Another way to look at it (I) • Old File: Four score and seven years ago our fathers brought forth, a new country, conceived in liberty, and dedicated to the proposition that "all men are created equal." Another way to look at it (II) • New File: Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal" Another way to look at it (III) • Identify Chunks: Four score and seven years ago our fathers brought forth, upon this continent, a new nation, conceived in liberty, and dedicated to the proposition that "all men are created equal" Another way to look at it (IV) • Send back to server the modified chunk: upon this continent, a new nation, conceived in liberty, in compressed form Pathological cases • Having too many chunks require too much aggregate bandwidth • Very large chunks would be too difficult to send in a single RPC • Chunk sizes must be between 2K and 64K – May have to artificially insert chunk boundaries when files are full of repeated sequences The chunk database (I) • The chunk database – Indexes chunks by first 64 bits of SHA-1 hash – Maps keys to (file,offset, count) triples • How to keep this database up to date? – Must update it whenever file is updated – Can still have problems with local updates at server site – Crashes can corrupt database contents The chunk database (II) • Best solution is to tolerate inconsistencies: – LBFS recomputes hash of any data chunk before using it – Recomputed value is also used to detect collisions • Very improbable but still possible Protocol • NFS with some changes: – Uses leases to implement close-to-open consistency (callbacks with limited lifetime) – Practices aggressive pipelining of RPC calls – Compresses all RPC traffic Leases • Leases are callbacks with – A limited lifetime (a few seconds) – A guarantee that server will not accept updates during lease lifetime without first notifying client • Advantages: – No problems with lost callbacks – Automatically expire when server crashes An example (I) Server Requests a lease During duration of lease Must now Alice Alice controls the file renew it Time An example (II) Server Got a lease Alice During duration of lease Also requests a lease Bob Alice controls the file Time An example • When server receives Bob's request, – It will try to contact Alice and break the lease • Alice will then flush all the blocks she had updated and invalidate the contents of her cache – If Alice does not answer, server must wait until Alice's lease expires File Consistency • LBFS – Caches entire files – Implements close-to-open consistency • Client – Gets a lease first time a file is opened for read – Renews expired leases by requesting file attributes – Will then check if cached copy is still current Reads and writes • Use additional calls not in NFS – GETHASH for reads – MKTMPFILE,and three other for write • Server ensures atomicity of updates by writing them first into a temporary file Security • More of an issue than in a well-controlled LAN • Uses SFS security infrastructure – Servers have public keys and authenticate themselves to clients • New Problem: – All LBFS users can check whether file system contains a specific chunk of data – Requires observing subtle timing differences Implementation • Some problems with the way NFS allocates i-node numbers Evaluation (I) • Compared upstream and downstream bandwidth of LBFS with those of – CIFS (Common Internet File System) – NFS – AFS – LBFS with leases and gzip but w/o chunking • Downstream traffic benefits most of chunking Evaluation (II) First four bars of each workload show upstream bandwidth, second four downstream bandwidth Conclusions • LBFS bandwidth usage is one order of magnitude less than conventional file systems