The Design and Implementation Log-Structured File System MENDEL ROSENBLUM University of California and JOHN of a K. OUSTERHOUT at Berkeley This paper presents a new technique for disk storage management called a log-structured file system, A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash recovery. The log is the only structure on disk; it contains indexing information so that files can be read back from the log efficiently. In order to maintain large free areas on disk for fast writing, we divide the log into segments and from heavily fragmented segments. We use a segment cleaner to compress the live information present a series of simulations that demonstrate the efficiency of a simple cleaning policy based on cost and benefit. We have implemented a prototype logstructured file system called Sprite LFS; it outperforms current Unix file systems by an order of magnitude for small-file writes while matching or exceeding Unix performance for reads and large writes. Even when the overhead for cleaning is included, Sprite LFS can use 70% of the disk bandwidth for writing, whereas Unix file systems typically can use only 5 – 10%. Systems]: Storage Management–aUocCategories and Subject Descriptors: D 4.2 [Operating strategies; secondary storage; D 4.3 [Operating Systems]: File Systems Manation / deallocation organization, du-ectory structures, access methods; D.4. 5 [Operating Systems]: agement—file measurements, simReliability—checkpoint /restart; D.4.8 [Operating Systems]: Performance– ulation, operatzon anatysis; H. 2.2 [Database Management]: Physical Design— recouery and restart; H. 3 2 [Information Systems]: Information Storage —file organization General Terms: Algorithms, Design, Measurement, Performance Additional Key Words and Phrases: Disk storage management, organization, file system performance, high write performance, fast crash recovery, file system logging, log-structured, Unix 1. INTRODUCTION Over the last access times decade have CPU only the future and it will To lessen the speeds improved cause more impact of this have slowly. This and more problem, increased dramatically trend is likely applications we have devised while disk to continue to become a new in disk-bound. disk storage The work described here was supported m part by the National Science Foundation under grant CCR-8900029, and in part by the National Aeronautics and Space Administration and the Defense Advanced Research Prolects Agency under contract NAG2-591. Authors’ addresses: J. Ousterhoust, Computer Science Division, Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720; M. Rosenblum, Computer Science Department, Stanford University, Stanford, CA 94305. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. @ 1992 ACM 0734-2071/92/0200-0026 $01.50 ACM Transactions on Computer Systems, Vol 10, No. 1, February 1992, Pages 26-52 The Design and Implementation management technique an order of magnitude Log-structured cached in caches more disk traffic writes This file main all called a log-structured more efficiently than systems memory and more will approach and are based that effective become new of a Log-Structured write read by writes. examine consistency the most The notion incorporated recovery [8, 9]. However, the permanent storage structure For but permanently file large segments other systems contrast, system the log. by eliminating file system system extents use the log only al- need only a log-structured files can to operate of free for temporary is in a traditional is no other so that systems. structure file system on disk. be read back efficiently, space available design based random-access stores The log con- with efficiency it must ensure that for writing new data. of a log-structured on large extents file called to explore different cleaning policies and discovered a algorithm based on cost and benefit: it segregates older, slowly changing data from differently during cleaning. We have file called where a segment cleaner process continually regenerates empty by compressing the live data from heavily fragmented segments. We used a simulator simple but effective more them As a result, also permits much faster must scan the entire disk a log-structured This is the most difficult challenge in the system. In this paper we present a solution segments, are the and a number of recent file systems have structure to speed up writes and crash in the log: there a log-structured are always In files make of the log. these on disk. [18]. structure dramatically home for information tains indexing information comparable to current file there a crash, portion of logging is not new, a log as an auxiliary storage; data after recent that sizes will requests 27 uses disks A log-structured most all seeks. The sequential nature of the log crash recovery: current Unix file systems typically to restore assumption in a sequential performance which systems. file memory at satisfying to disk increases on the . system, current increasing dominated information file File System constructed a prototype younger rapidly changing log-structured file data system and called treats Sprite LFS, which is now in production use as part of the Sprite network operating system [17]. Benchmark programs demonstrate that the raw writing speed of Sprite LFS is more than an order of magnitude greater than that of Unix for small files. Even for other workloads, such as those including reads and large-file accesses, Sprite LFS is at least as fast as Unix in all cases but one (files read sequentially after being written randomly). long-term overhead for cleaning in the production LFS permits about 65-75% of a disk’s raw bandwidth We also measured the system. Overall, Sprite to be used for writing new data (the rest is used for cleaning). For comparison, Unix systems can only utilize 5-1090 of a disk’s raw bandwidth for writing new data; the rest of the time is spent seeking. The remainder of this paper reviews the issues in designing is organized file systems into six sections. Section 2 for computers of the 1990s. Section 3 discusses the design alternatives for a log-structured file system and derives the structure of Sprite LFS, with particular focus on the cleaning mechanism. Section 4 describes the crash recovery system for Sprite LFS. ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 28 . Section M. Rosenblum and J. K. Ousterhout 5 evaluates measurements file systems, 2. DESIGN Sprite LFS of cleaning and Section using overhead. benchmark Section programs 6 compares and Sprite long-term LFS to other 7 concludes. FOR FILE SYSTEMS OF THE 1990’s File system design is governed by two general forces: technology, provides a set of basic building blocks, and workload, which determines of operations technology system Sprite that design. LFS be carried that are and shows how out efficiently. underway It also describes the workloads 2.1 must changes file This describes the workloads current and technology and that systems section their summarizes impact influenced which a set the are ill-equipped on file design of to deal with changes. Technology Three components of technology are particularly significant for file system design: processors, disks, and main memory. Processors are significant because their speed is increasing at a nearly exponential rate, and the improvements seem likely to continue through much of the 1990s. This puts pressure on all the other elements of the computer system to speed up as well, the system doesn’t become unbalanced. Disk technology is also improving rapidly, but the improvements been primarily There are access time. improvement in the two Although is much can be improved disks [19], but determined tion causes application areas components of cost and of disk both slower capacity rather performance: motions that a sequence of small is not likely to experience disk are hard transfers much performance. separated and the rate of bandwidth and parallel-head access time (it to improve). speedup have bandwidth of these factors are improving, than for CPU speed. Disk transfer substantially with the use of disk arrays no major improvements seem likely for by mechanical than transfer so that is If an applica- by seeks, over the next then the ten years, even with faster processors. The third component of technology is main memory, which is increasing in size at an exponential rate. Modern file systems cache recently used file data in main memory, and larger main memories make larger file caches possible. This has two effects on file system behavior. First, larger file caches alter the workload presented to the disk by absorbing a greater fraction of the read requests [2, 18]. Most write requests must eventually be reflected on disk for safety, so disk traffic (and disk performance) will become more and more dominated by writes. The second impact of large file caches is that they can serve as write buffers where large numbers of modified blocks can be collected before writing any of them to disk. Buffering may make it possible to write the blocks more efficiently, for example by writing them all in a single sequential transfer with only one seek. Of course, write-buffering has the disadvantage of increasing the amount of data lost during a crash. For this paper we will assume that crashes are infrequent and that it is acceptable to lose a few ACM TransactIons on Computer Systems, Vol. 10, No 1, February 1992 The Design and Implementation of work seconds or minutes crash recovery, nonvolatile 2.2 of a Log-Structured . 29 in each crash; for applications that require RAM may be used for the write buffer. better Workloads Several tions. different file One of the system most is found efficiently neering applications in workloads difficult found dominated for file not such files system sequentially out focus on the efficiency 2.3 LFS Problems Current them file spread work well systems information separate disk, It takes name. files. the file’s as those but for ensuring performance and leave accesses. for large from two the files that tends it to hardware Fortunately, as well general and disk to be design- the techniques as small problems workloads a way in Furthermore, contents, at least exist so 1/0 accesses, technologies around different from such problems, and memory subsystems rather than the a log-structured file system we decided to ones. five the as is the separate that causes attributes 1/0s, make it hard 1990s. First, for they too many small system (Unix FFS) [121 is on disk, but it physically directory disk that of the accesses. For example, the Berkeley Unix fast file out each file sequentially quite effective at laying separates files, File Systems suffer the to large of techniques for large-file with Existing to cope with to handle to file system and blocks of also pose interesting on of small-file bandwidth used in Sprite accesses A number limited by the bandwidth of the 1/0 file allocation policies. In designing to improve applica- designs dominated by updates to locate the attributes environments, software. laid are computer in system file sizes of only a few kilobytes [2, 10, 18, 231. small random disk 1/0s, and the creation and by sequential supercomputing in for file office and engineering environments. Office and engitend to be dominated by accesses to small files; several deletion times for such files are often “ metadata” (the data structures used the file). Workloads common are workloads studies have measured mean in Small files usually result ers File System (” inode”) entry for a containing each preceded file the are file’s by a seek, to FFS: two different accesses to the file’s attributes create a new file in Unix data, and the directory’s plus one access each for the file’s data, the directory’s small files in such a system, less than 5% of the is used for new data; the rest of the time is spent attributes. When writing disk’s potential bandwidth seeking. The second problem synchronously: the than continuing even though metadata while Unix the FFS structures chronously. nated with by application’s application For the current application write such file as with data many synchronous metadata to from for CPUS. ACM Transactions the they to the background. inodes files, the disk also and example file system written writes the write For traffic make defeat to rather are Synchronous disk They tend complete, asynchronously, small of that write and writes. that faster in is the blocks directories performance to benefit systems wait is handled writes workloads file must it synis domi- couple hard potential the for use the of on Computer Systems, VO1 10, No 1, February 1992 30 M. Rosenblum and J. K. Ousterhout . the file cache as a write [21] to have buffer. Unfortunately, network introduced additional synchronous This simplified crash exist. has behavior recovery, but file systems like NFS where it not used it has did reduced write performance. Throughout FFS) this paper as an example tured file mented The systems. in the The most 3. LOG-STRUCTURED The fundamental performance and then data all the used not system it Unix (Unix it to log-struc- because unique to is well operating to Unix disk the docu- systems. FFS and can writes sequential file For that file can utilize the a single operation and system traditional in in write system. file of transfers sequentially in the is to improve changes directories log-structured random system system to disk blocks, manage a file of file written index to files, synchronous chronous changes attributes, small are a sequence information information many section used file compare popular of a log-structured The blocks, is several fast and systems. buffering writing operation. design in Unix design FILE SYSTEMS idea by FFS this file Berkeley system used in other the file and presented in use Unix literature problems be found we of current write file cache disk write includes almost all workloads file the that other contain converts the many small systems into large asyn- nearly 100~o of the raw disk bandwidth. Although two the key logging this the free for writing space The used Sprite by in LFS detail LFS. to Our goal cal to called an sions, more indirect In ten Once the each a file, ACM Transactions available data data log; manage is the on-disk the on or exceed the read basic each the inode also of which a file’s inode topic of structures structures Systems, Vol. there ten the the been found, in Sprite LFS location 10, yields No the performance contains calculation sequential is not are by 1, February Sprite LFS disk 1992 are given identi- structure owner, file; permisfor files of one of more number Unix To to permit addresses addresses are Sprite FFS. log of the disk the in a data (type, blocks on disk; the in the exists and scans case of Unix structures has is at a fixed that attributes first contains is identical inode this used file file’s of the each a simple index structures for addresses file Computer outputs FFS; the suggest log, contains disk blocks, FFS for might the The blocks, to read Unix always it the to paper. from LFS Unix which the blocks. number Sprite in inode, than indirect to match used larger are are of the from is how issue; of the problems; of the information goal, etc. ) plus required above the sections space difficult solve retrievals. those of free issue there benefits information second more “log-structured” was this random-access extents a much is simple, potential and Reading retrieve accomplish The a summary later the to retrieve I contains to in system achieve below. large is file to is how 3.1 This Table the term required issue so that data. File Location Although first on disk 3.2-3.6. be resolved of Section new Sections of a log-structured must subject discussed idea that approach. is the 3.1 basic issues data of disk or or 1/0s FFS. the address identifying of the file’s The Design and Implementation Table I. Summary of the Major Data structure Incde Inmfe map of a Log-Structured Data Structures File System Purpose Segmentusage!able Superblock Checkpoint region Dmxtory changelog 31 Stored on Disk by Sprite LFS. Locates blocks of file, holds protection bits, modify time, etc. Locates posiuon of mode in log, holds time of fast accessPIUS version number. Locates blocks of large files. Identifies contents of segment (file number and offset for each block). stomalastwrite timefor Counts live bytes still left in segmen!s, data in segments. Hotds static configuration information such as numlxr of segments and segmentsize. Locates blocks of inode map and segmentusagetable, identifies last checkpoint in log. Records directory opxations to maintain consistency of referencecounts in inodes. Indirect block Segmentsummary . Location Log Log Section Log Log 3.1 3.2 Log 3.6 Fixcd None Fixed 4.1 Log 4.2 3.1 3.1 For each data structure the table indicates the purpose served by the data structure in Sprite LFS. The table also indicates whether the data structure is stored in the log or at a fixed position on disk and where in the paper the data structure is discussed in detail. Inodes, indirect blocks, and superblocks are similar. to the Unix FFS data structures with the same names. Note that Sprite LFS contains neither a bitmap or a free list. inode. In contrast, written to the maintain the for a file, inode. Sprite log. current the inode The inode checkpoint blocks. doesn’t LFS map must on each cached in into main are memory: Given that the the an the disk locations to lookups are to map number address of the to the log; a fixed of all enough map they inode identifying are written compact inode positions; called to determine blocks identifies maps at fixed structure inode. be indexed disk inode inodes a data of each map is divided region place uses location Fortunately, portions LFS Sprite the inode keep rarely the map active require disk accesses. Figure FFS 1 shows after layouts creating have duces a much the 3.2 The most ment difficult layouts new files logical compact is much Free Space disk two same more of Sprite LFS just as good. the structure, arrangement. better Management: design that would in different than Unix the occur in Sprite LFS directories. Although log-structured file As a result, FFS, while the write its read and Unix the system two pro- performance performance is Segments issue for log-structured of free space. The goal is to maintain file large systems is the manage- free extents for writing new data. Initially all the free space is in a single extent on disk, but by the time the log reaches the end of the disk the free space will have been fragmented into many small overwritten. From this point extents corresponding on, the file system to the files has two choices: that were threading deleted or and copying. These are illustrated in Figure 2. The first alternative is to leave the live data in place and thread the log through the free extents. Unfortunately, threading will cause the free space to become severely fragmented, so that large contiguous writes won’t be possible and a log-structured file system will ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992. 32 . M, Rosenblum and J. K. Ousterhout dirl dir2 filel Lag + Disk Sprite filel .* &w ~: file2 ‘1 ~ ~: ,, ,::: ~= ..... :;I ,,,, + LFS ‘? dirl file2 ... ;: !!J ~ dir2 & Disk Unix FFS Fig. 1. A comparison between Sprite LFS and Unix FFS. This example shows the modified disk blocks written by Sprite LFS and Unix FFS when creating two single-block files named dirl /filel and dlr2 /flle2. Each system must write new data blocks and inodes for file 1 and flle2, plus new data blocks and inodes for the containing directories. Unix FFS requires ten nonsequential writes for the new information (the inodes for the new files are each written twice to ease recovery from crashes), while Sprite LFS performs the operations in a single large write. The same number of disk accesses will be required to read the files in the two systems. Sprite LFS also writes out new inode map blocks to record the new inode locations Block Copy and Compact Threaded log Key: Old data block New data block Prcwously deleted la Old log end New log end Old log end New log end II II Fig, 2. Possible free space management solutions for log-structured file systems, In a log-structured file system, free space for the log can be generated either by copying the old blocks or by threading the log around the old blocks. The left side of the figure shows the threaded log approach where the log skips over the active blocks and overwrites blocks of files that have been Pointers between the blocks of the log are mamtained so that the log can deleted or overwritten. be followed during crash recovery The right side of the figure shows the copying scheme where log space is generated by reading the section of disk after the end of the log and rewriting the active blocks of that section along with the new data into the newly generated space. be no faster than traditional file systems. The live data out of the log in order to leave large second alternative is to copy free extents for writing. For this paper we will assume that the live data is written back in a compacted form at the head of the log; it could also be moved to another log-structured file system to form a hierarchy of logs, or it could be moved to some totally different file system or archive. The disadvantage of copying is its cost, particularly for long-lived files; in the simplest case where the log works circularly across the disk and live data is copied back into the log, all of the long-lived files will have to be copied in every pass of the log across the disk. Sprite LFS uses a combination of threading and copying. The disk is segments. Any given segment is divided into large fixed-size extents called always written sequentially from its beginning to its end, and all live data must be copied out of a segment before the segment can be rewritten. However, the log is threaded on a segment-by-segment basis; if the system ACM Transactions on Computer Systems, Vol. 10, No. 1, February 1992, The Design and Implementation can collect long-lived data skipped over segment size is chosen whole so that segment segment. This bandwidth cessed. Sprite data large into doesn’t enough greater allows of the together the is much of a Log-Structured that than segments to be copied the transfer operations regardless LFS currently those . 33 can be repeatedly. time to read The or write the cost of a seek to the beginning whole-segment disk, segments, have File System of the to run order uses segment at in which sizes of either a of the nearly the segments full are 512 kilobytes ac- or one megabyte. 3.3 Segment Cleaning Mechanism cleaning. The process of copying live data out of a segment is called segment In Sprite LFS it is a simple three-step process: read a number of segments into memory, number that identify of clean were read additional As part the live segments. are marked cleaning. of segment each segment possible file the live operation and they it must so that the and write this as clean, cleaning are live, to identify data, After they back to a smaller the segments can be used for new data be possible to identify can be written to which data is complete, which out again. each block belongs blocks It must and the or for of also be position of the block within the file; this information is needed in order to update the file’s inode to point to the new location of the block. Sprite LFS solves both of these problems by writing The summary segment; the for file multiple fill the blocks a segment identifies example, for number and segment summary segment. crash each block in the blocks file recovery data number blocks file impose summary as part block each piece of information the more writes cache little block for when (Partial-segment buffered summary during block the block. (see Section during 4) as well block Segments the in the contains can one log write when is insufficient overhead is written summary than occur of each segment. that contain is needed number to of dirty to fill a segment). Segment writing, and they are useful as during cleaning. Sprite LFS also uses the segment summary information to distinguish live blocks from those that have been overwritten or deleted. Once a block’s identity is known, its liveness can be determined by checking the file’s inode or indirect block to see if the block. It it does, then Sprite LFS the inode ever the file bined with optimizes map entry is deleted the inode the appropriate block is live; if it doesn’t, block this check for each file; or truncated number form slightly by the version pointer then still refers the block keeping a version number is incremented to length zero, The version a unique identifier (uid) to this is dead. number number for the in whencom- contents of the file. The segment summary block records this uid for each block in the segment; if the uid of a block does not match the uid currently stored in the inode map when the segment is cleaned, the block can be discarded immediately without examining the file’s inode. This approach to cleaning means that there is no free-block list or bitmap in Sprite. In addition to saving memory and disk space, the elimination of these data structures also simplifies ACM Transactions crash recovery. on Computer Systems, If these Vol. data 10, No. structures 1, February 1992. 34 M. Rosenblum and J, K. Ousterhout ● existed, additional restore consistency 3.4 Segment basic (1) When only many (2) How (3) Which should ment. but live files call this approach Sprite to the fourth LFS the file system. segments We use is the The remainder and how including of overhead time called the the and data overhead. mum A time data For a log-structured latency are total number of those bytes ACM Transactions One by output seglast were segments; we first two of the number the of a few tens of segthe number of clean of bytes that disk means that for for moved represent with which of its and one-tenth new no there the as a cleaning with would no seek mean that is no cleaning of the data; cost written, expressed were it write data bandwidth is perfect: only is there full The of new cost if bandwidth disk’s rest maxi- of the disk or cleaning. large and to and analysis policies. byte write writing writing new per of 1.0 latency, system both at cost they of a log-structured our cleaning The written third the experience data. required full used In contrast, performance is busy be be rotational file negligible the the in our 3 discusses overheads. 10 values. live disk A write of the the would actually seeks, the at the cost is in out? they new when important: to compare cost could be written write is spent into addressed threshold Section of that latency. bandwidth time the segments critically to group cleaning the could are for example a single value (typically at a time until determine of time time or rotational new by at that ones written reads, into offers cleaned choice. are age together cleaning of the are write amount all multiple they blocks the of segments that to clean a term or cleaning is the best of future together a threshold choice factors average for segments choice when not methodically starts decisions primary are are at night, another threshold value (typically 50– 100 clean segof Sprite LFS does not seem to be very performance exact policy be age sort. segments surpasses ments). The overall sensitive Segment to be the locality of similar clean segments drops below It cleans a few tens ments). and not grouped the is to sort blocks so far we have policies. and must choices or only more the An obvious directory same possibility and group above a time? at out be to enhance modified In our work turns blocks issues possible Some on disk; be cleaned? this in the Another clean to rearrange. is to try grouping it opportunities the possibility structures exhausted. should should policy at low priority, data segments (4) How execute? to reorganize fragmented, most cleaner is nearly segments once, the more to the four above, in background space opportunity an described segment the disk to log changes Policies continuously when be needed crashes. mechanism should to run it after Cleaning the Given addressed: would code from data. segments, for seek cleaning, the This on Computer Systems, Vol. 10, No 1, February and so the disk divided cost is 1992 rotational write by the determined cost is number by the The Design and Implementation utilization the (the steady segment and fraction state, of new writes of data the data out written. and O < u < space new data. Thus write cost live) must in the of live 1). This one it reads data N*(1 – bytes read and that clean . are for in their u is the utilization u) segments 35 cleaned. segment N segments (where creates total File System segments generate To do this, N* u segments segments for still cleaner of a Log-Structured In every entirety of the of contiguous free written = read new data written segs + write live + write new — new N+ — — data written N*u+N*(l–u) N*(l-zL) In the above must faster (we to read specific write can form write (see of less the current is overall Unix in segments utilized to Even that to be less utilized for the trade-off at storage if algorithm cleaned. Variations others, these FFS for Section disk 5.1 request cleaned file a for sort - [24] must system than or a have a to outper- .5 to outperform will discussed data; have it the the usage cleaner lower above is just in file and are is not fraction will can the of live cause choose utilization disk space. will have fewer are usable reduced space but than so is FFS some the the least overall only to operate ACM not 10’%o is kept to 9070 free of to disk by in use, resulting in a cost-performance can utilization Such unique allows a performance capacity be improved of the blocks, provide performance. is remaining disk can less live systems if system With higher byte; utilization Unix The file is underutilized, per file of the cleaned space cost files. of a log-structured utilization example, by Unix bandwidth, bandwidth be less utilization live Log-structured and For occupied disk costs performance tems. cost. a high must no live reference, and of the be low disk. that write writes, has is 1.0. 8 in a log-structured the than performance overall segments lower Figure may cost disk segments containing are to clean; the for write u. For 2590 the utilization that disk segments so, the reducing note of the segments average about that order the the of delayed to it is very to be cleaned and and a segment practice utilization 5 – 109ZO of the [16] logging, .8 in a segment most improved FFS; if the at all that in FFS. important fraction blocks than blocks; as a function at 3 suggests Unix If be read cost With assumption live particularly Ousterhout be Figure the LFS). not utilizes probably improved tion Sprite the 10–20 of 4. utilization but in need (1) ‘I-u conservative blocks, it workloads of cost the this the to recover live measurements). this It the 3 graphs cost ing just file made entirety = O) then small write we its tried (u Figure an in haven’t blocks on formula be read 2 — be achieved is increased, a trade-off between log-structured the disk allow the file space space systo be alloca- efficiently. Transactions on Computer Systems, Vol. 10, No, 1, February 1992. 36 M. Rosenblum and J. K. Ousterhout . Write ~4co cost . . . . L.......L...-[ ......J. . . . . . . . .. .......... ~_ Log-structured 12.0 --”/ /, lo.o--” -----------go .... --.--.—— ———— ~ FFs today ~.. ;... 6.0 --~ 4.0 --”;-””-”---------”--------” -----------------------+-.------------------! F’FS improved 2.0...’ :--- /: O.-J -- ~. . . . . . . .. T........ 0.0 0.2 Fraction ..r.........f.........1.......... ... 0.4 0.6 0:8 alive in segment 1:0 cleaned (u) Fig, 3. Write cost as a function of u for small files In a log-structured cost depends strongly on the utilization of the segments that are cleaned, segments cleaned, the more disk bandwidth that is needed for cleaning writing new data. The figure also shows two reference points: which is our estimate of Unix FFS today, and “FFS Improved,” an improved Unix FFS. Write cost for Umx FFS is not file system, the write The more live data in and not available for ‘‘FFS to day,” the sensitive to represents which best performance the amount possible of disk in space in use, The system the key to achieving is to force segments are cleaner can overall disk section describes almost Simulation We built reflect and The the At one actual but work a log-structured file where nearly most empty, segments. a low bimodal helped This write cost. distribution in can a file be (its simulator of and allows The the a high following Sprite Each file has Files are divided files; the time. the files groups it LFS. to each they file models Computer Systems, two hot other pattern access the cost files patterns files, capacity with not than of cleaning. of 4-kbyte disk does harsher of random of the different model much number likelihood into is called but is reduce overall one analyze with utilization. new data, using patterns: equal The model as a fixed overwrites could simulator’s effects a particular access the the exploited system to produce we The patterns us to understand pseudo-random on so that conditions. usage of which models the simulator controlled Uniform TransactIons in distribution or empty provides such system Hot-and-cold ACM cost empty the yet low segment are with achieve system chosen step, a few utilization file both simulator of two always under file it number each full, we at a bimodal Results a simple locality, performance into nearly how policies reality), disk capacity 3.5 cleaning high the group are is equally 10, because No. its only likely form selected One is called selected a simple Vol of being groups. files cold; 107o are it in 1992 each step. contains 10% of selected 90% of 90% of contains of the to be selected. of locality. 1, February group time. This Within access The Design and Implementation Write of a Log-Structured File System . 37 cost {.----------”: LFS hot-and-cold 12.0 “-~ &O --\ Go --~ 0.0.. ; .. ... .. .. ~.---.- ..-:.... .. .... . .. ... .. .. . .. .. ... .... 0.0 0.2 Disk 0.4 0.6 capacity 0:8 1:0 utilization Fig. 4. Initial simulation results. The curves labeled “FFS today” and “FFS improved” are reproduced from Figure 3 for comparison, The curve labeled “No variance” shows the write cost that would occur if all segments always had exactly the same utilization. The “LFS uniform” curve represents a log-structured file system with uniform access pattern and a greedy cleaning policy: the cleaner chooses the least-utilized segments. The “LFS hot-and-cold” curve represents a log-structured file system with locality of file access. It uses a greedy cleaning policy and the cleaner also sorts the live data by age before writing it out again. The x-axis is overall disk capacity utilization, which is not necessarily the same as the utilization of the segments being cleaned. In this approach traffic is hausted, clean the then until The simulates segments run overall modeled. disk capacity simulator runs the actions is available the write In each stabilized and is constant all of a cleaner again. cost utilization until clean until run the all and no read segments a threshold simulator cold-start was are ex- number of allowed to variance had been removed. Figure 4 superimposes curves of pattern was chose the cleaner the Figure used. the In the The same attempt order that from used pattern the is no When data: in the reason of simulations the greedy clean. appeared there sets simulations a simple to to reorganize they two uniform” segments not access results “LFS cleaner least-utilized did uniform 3. policy out blocks being expect any the variance than would the access it live were segments to where writing live onto uniform always data the written out cleaned (for improvement in a from reorganization). Even with lization the uniform allows overall overall utilization write capacity capacity of only cost have disk disk drops no live The access was the 55%. patterns, as for patterns, write cost and formula utilization At at all the overall 2.0; this and hot-and-cold” same access lower utilization, below blocks “LFS the random a substantially as ACM uniform” Transactions the above. except need write The that on Computer For cleaned capacity that don’t shows described “LFS disk means hence curve (1). segments in of the to be read cost when cleaning the live Systems, at an under cleaned 75% average 2090 the segments in. there is locality policy for blocks were Vol. utifrom example, have utilization some segment be predicted 10, No. this sorted 1, February in curve by 1992. 38 . M, Rosenblum and J, K. Ousterhout age before tends writing to be thought that segment this Figure worse degree doesn’t worse cold a very until point numbers large After locality and valuable these than data data; we distribution it space. segment Said with of beneficial space cleaning value and a hot and access segment patterns. the longer estimated by To this test clean. free space The segment the and amount to cost of space block in how long the space-time product cleaning the back the that live — cost in to current the and delay the segment. data without older The in the knowing the the a cold quickly of the the from it is less as well predicted that blocks die has unusable before likely might the must is more contrast, stability be the segment free time unchanged, is the formed by is 1 + free Computer of free We likely data). The used age to the and data stability has the in can a be all generated* Vol. 10, segments of time is just 1 – recent modified The two of cost these age read factors, of data as an of — — 1, February 1992 of we get (1 - any estimate of is The segment, the space u is the cleaning u)*age l+U No. the time components. to with the u, where block) benefit to the components: most cost Systems, the two amount youngest segments of cleaning space these unit Combining space and free. multiplying u (one selecting benefit chooses benefit of the stay for to the segment cost. amount (i.e., policy according be reclaimed – on segments the ” In will the a new the to will The segment benefit TransactIons of cleaning segment. space data system remain segment benefit segment the the on in tend segment a cold a long die clustered than cold the again. the assumption to simulated each free. of the an we the stay utilization Using likely them cannot are in point of data. rates ratio of free is likely age stability is theory policy highest it reclaims blocks cleaning reaccumulates back in- slowly segments a cold once system is based the and it them the cold in because to “keep” of the hot before because more Unfortunately, future get “takes very of time. space segment reaccumulate; let that Thus threshold, locality is that 5 shows a segment segments with periods Free the above that segments. drops more result long time segment rapidly many realized once it will of a segment’s segment. we way, data a while for a hot fragmented will blocks a long another to clean free that Figure cleaning just varying found policy, of all the tried and greedy utilization simulations overall data) utilized to in The of grouping We increased. to linger cleaner. in take cold becomes least the the 5% the drops 5 shows the space will to locality the “better” no locality! Under tend figures by free cleaned as the becomes of free differently with accesses result. locality. studying be treated ACM that of segments Figure tie write result Unfortunately, time. without the bimodal eventually cleaning simulations the (cold) (hot) desired worse it so these the up and segments. long around data long-lived short-lived to the 95% utilization segments, free that from lead a system nonintuitive cleaned the been means than (e.g., this segment’s cluding surprising locality for get every the got reason for This segments would performance of performance cold again. different approach 4 shows in the out in utilizations. result the them segregated “ the cost of u to The Design and Implementation of a Log-Structured File System . 39 Fraction of segments ~.~g .. ;.........!--!-.... 0.007 ‘- I f).f)~ --~ \ ().005 --~ 0.004 ‘-; 0.003 ‘-~ 0.002 --~ (),001 --~ ;... O.000 --~ 0.0 0:2 ...........,---.----+. 1.0 0:8 0:4 Segment 0:6 utilization Fig. 5. Segment utilization distributions with greedy cleaner. These figures show distributions of segment utilizations of the disk during the simulation. The distribution is computed by measuring the utilizations of all segments on the disk at the points during the simulation when segment cleaning was initiated. The distribution shows the utilizations of the segments available to the cleaning algorithm. Each of the distributions corresponds to an overall disk capacity utilization of 75Y0. The “Uniform” curve Corresponds to ‘‘LFS uniform” in F@re 4 and “H&aml-colcl” corresponds to “LFS hot-and-cold” in Figure 4. Locality causes the distribution to be more skewed towards the utilization cleaned at a higher average utilization. at which cleaning occurs; as a result, segments are Fraction of segments o.m8 ..L ........L...L ...... 0.007 -“j 0.006 -j 0.005 --+ 0.004 --j .. LFS Cost-Benefit 0.003 --j 0.002 --~ ().001 --j 0.000 --: 0.0 4 L . . . . . . . . . . . . . . . . . . . .!. . 0:2 0:4 Segment 0:6 0:8 1.0 utilization Fig. 6. Segment utilization distribution with cost-benefit policy. This figure shows the distribution of segment utilizations from the simulation of a hot-and-cold access pattern with 75% overall disk capacity utilization. The “LFS Cost-Benefit” curve shows the segment distribution occurring when the cost-benefit policy is used to select segments to clean and live blocks are grouped by age before being rewritten. Because of this bimodal segment distribution, most of the segments cleaned had utilizations around 15!%. For comparison, the distribution produced by the greedy method selection policy is shown by the “LFS Greedy” curve reproduced from Figure 5. ACM TransactIons on Computer Systems, Vol. 10, No. 1, February 1992 40 . M. Rosenblum and J. K. Ousterhout 0.0 T..........r............. . ..~ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 0.0 0.2 0.4 Disk 0.6 capacity 0.8 1:0 utilization 7. Write cost, including cost-benefit pohcy. This graph compares the write cost of the greedy policy with that of the cost-benefit policy for the hot-and-cold access pattern The cost-benefit policy is substantially better than the g-reedy policy, particularly for disk capacity utilizations above 6070. Fig. We call this cleaned We policy reran the cost-benefit and 6, ments that we about 7570 utilization about 15$Z0 before of the segments reduces the better high The simulation in file 3.6 In systems Segment order data to time the of any cleaner LFS. Sprite 5070 the that be cold reach the the from of seg- segments at to hot files, cost-benefit greedy and FFS even a number cost-benefit of most policy policy, Unix simulated the seen a utilization are possible the with be distribution that over We found can cleans writes best pattern As segments 7 shows as to a at of other policy gets even the number the choosing is written, are deleted or ACM TransactIons blocks on is even in better to implement Section than 5.2, the the predicted Computer cleaning segment of live segment us be seen cost-benefit the in the convinced will in cost-benefit behavior of actual Figure 7. Table called block As LFS support when bimodal policy utilizations. and the 90’%0 of the Figure experiments Usage structure records capacity data. hot segments access live cleaning much cold increases. Sprite in the out-performs of locality as locality hot-and-cold on Since hot. as system disk approach are by allows segments. the until them. cost kinds The waits cleaning it hot produced for. but file and policy cleaned log-structured relatively under hoped write than age-sorting cost-benefit had policy; utilization simulations policy the cost-benefit higher Figure degrees the at a much bytes usage in the segment. and are the segment These segments to Vol If 10, No. Sprite 1, February maintains each and the most recent are used by values bytes the LFS For values The of live overwritten. Systems, two clean. count policy, table. segment, are the is decremented count 1992 falls to modified the initially zero a table segment set when when files then the The Design and Implementation segment table can are checkpoint In be order to the LFS single modified that sort of does are summary (see age Sprite When time for is In [3, already used the each usage stored in the We block in At a file; will plan for information segment. estimate times be present it keeps a incorrect to modify each the for segment block. many to the checkpoint. the made, typical reboot correct the must The disk have during to so it the may logs, consistency. in in locations end crashes. any system scan cost configurations), of the of the This log. checkpoints, which which is used in the log disk all of of these and it is define known in other LFS uses consistent to recover are be possible is well [6] and Sprite operations it should of logs systems systems, last Thus benefit database logging roll-forward, file); order without were restore file expand. at the both recovery: and changes system other the in systems on a new containing file to performed example, operations of minutes are operations (for these systems after few directory disk (tens file quickly last last on they system, to and file has systems a two-pronged states information of the written file since Checkpoints A checkpoint are This Unix the to advantage approach 4.1 the high 91. Like last file. state review as storage very 8, 41 segment are summary to entirety. the must where a log-structured been of the blocks segment for modified traditional to determine: recover their writing structures higher easy in the written times inconsistent system determine getting age, entire occurs, an without metadata scans the blocks of the . details). block to include in inconsistencies. cannot by modified modified it written In blocks youngest crash left operating the live The addresses 4 for the keep not and File System RECOVERY have been cleaning. the not a system may without log, Section information 4. CRASH the to the regions records files reused written of a Log-Structured is a position consistent checkpoint. file data segment and First, on disk. in the pointer During segment reboot, information and a crash during regions, and a LFS the there block be updated. the structures to create of the the log, inode addresses plus the map of current a including to a special region checkpoint operation last uses reads operations will and table, checkpoint is in the blocks contains usage data time regions and system process to the a checkpoint main-memory checkpoint file information region segment of the a two-phase its time not modified writes all uses and fixed all time the and a written. Sprite to initialize LFS inodes, it checkpoint map last all blocks, Second, The inode to the out indirect table. at which Sprite it writes blocks, usage position blocks complete. of the one ACM checkpoint During with alternate the Transactions reboot, most are between region, the recent on Computer region structures. actually them. so if the system and In order that to handle two checkpoint The checkpoint checkpoint reads uses both fails the checkpoint time. Systems, Vol. 10, No. 1, February 1992. 42 M. Rosenblum and J K. Ousterhout . Sprite file LFS performs system is between checkpoints increases the interval LFS probably this overhead 4.2 too when file but of the is not the cost of new checkpoints operating but of normal operation. seconds, which checkpointing data while the interval checkpoint periodic time long a short of thirty to as when A recovery; interval amount as well down. writing increases on recovery system shut during alternative a given set a limit the overhead forward An after would is a checkpoint short. intervals system the time uses checkpoints log; the to roll recovery much perform the needed currently at periodic or reduces time improves Sprite checkpoints unmounted has been reducing to written the at maximum is is to checkpoint throughput. Roll-Forward In principle latest it would checkpoint checkpoint. since This the tion last the would presence the checkpoint, This file file automatically then on disk table from the checkpoint after will adjusted fied The to reflect by the The directory entries directory entries is deleted. been the and operation code count the tion for the log entry log for of the directory the for entry named operation appears log; in in the to when not yet log recovered copy of’ the of the segment usage written the will live also of these since data have can left to be be identi- to occur while and These LFS guarantees the location and records that corresponding inode. ACM TransactIons on Computer Systems, Vol. 10, No 1, February 1992. LFS record outputs a includes of the the new collectively directory directory an directory directory), each has versa. the are of file containing or vice within the an inode block Sprite The i-number), number to zero when the written, the position between of the drops inodes, unlink), the consistency count change. entry. before the from inode. version segments a count been and directory or new in the count (name the blocks. restore a crash and Sprite data (both for rename directory the a new the indicates read of the into that segments summary it copy to reflect reference has written block map blocks contains directories each link, the directory inode entry were log). inode a new written informa- segment without of the how inode; between (create, (i-number contents with directory in the it is possible log new older is Each new utilizations overwrites in to that consistency record entry inodes. referring corresponding and inodes the be adjusted of roll-forward inode data utilizations must deletions in to the To restore special they Unfortunately, written The of new issue the the in the a file assumes ignores adjusts that a summary to new data that roll-forward. When for any the after as much segments information refers file’s utilizations file presence final map code it log reading log but to recover updates discovered checkpoint. be zero; roll-forward. LFS roll-forward and the data. the also the uses inode are code the is called file Sprite the is incomplete roll-forward read LFS inode, the simply in order through written blocks by data recovery In operation incorporates If data inode, The the Sprite so that crashes any be lost. scans recently after in instantaneous This of a new system. file’s LFS roll-forward the restart discarding would checkpoint. to recover to and result Sprite last During blocks safe checkpoint as possible, after be region the reference called operablock or The Design and Implementation During roll-forward, tency between inode and directory directory and/or cause entries on inodes ries, block inode were inode this the case functions, both the usage directory entry directory log will made appends table blocks only the easy to and reference the changed to the the the counts and that writes can’t be written; addition to an can directo- log is never provide consisbut updates operation In 43 operations inode removed. it ensure appears Roll-forward which be to entry . roll-forward The for used directories them. file is written, from File System a log program include of a new log if operation. segment to creation the not recovery and region is the inodes: to or removed The map, checkpoint operation and to complete to be added completed directory entries to be updated. inodes, a new the directory of a Log-Structured in its atomic other rename operation. The interaction troduced each between additional checkpoint log is must consistent required are began was the being of 1990 it described been used thirty in this paper not yet duction disks use When we system tional file segment and layout that either In code project this FFS FFS or the directory in in- particular, operation the log. This modifications by disks in were while Unix by restore since they but rollpro- all the a log-structured turns addition, than Logging somewhat more include file than out to a tradi- be no complexity the complicated be are The discard elimination consistency. to fall features LFS, and additional the in the system. that LFS has FFS; Sprite it the which of to implement Sprite mid-1990 Since reboot. concerned LFS All seconds) they by partitions, production complicated likely LFS, disk (30 when Sprite and system. in the is no more to [81 are 1989 computing. compensated LFS Sprite different however, is late operating interval we [12]: in implemented more required in Sprite Unix Unix been checkpoint FFS [91 or Cedar five installed reality, but LFS day-to-day checkpoint last Unix policies scans Episode have the cleaner, roll-forward for be substantially than In blocks directory network manage been the system. complicated to a short began might checkpoints LFS. LFS Sprite users has after where directory of Sprite of the forward information and Sprite written. as part about state and log into to prevent implementation has by a inode WITH THE SPRITE operational used the operation issues synchronization 5. EXPERIENCE We directory represent with additional checkpoints the synchronization both more for of the checkpointing the and code [131 fsck file systems like complicated logging the bitmap than and layout code. In the everyday Unix being For use Sprite FFS-like used are example, file not on fast the 20’% faster than Most of the speedup in Sprite LFS. LFS system enough modified SunOS Even using does in not the much The different reason to be disk-bound Andrew the is attributable with feel Sprite. configuration to the synchronous ACM Transactions with benchmark removal writes is the [16], presented of the of Unix to the that users the current Sprite workloads. LFS in than machines Section is only 5.1. synchronous writes FFS, bench- the on Computer Systems, Vol. 10, No. 1, February 1992. 44 M. Rosenblum and J. K. Ousterhout , mark has changes 5.1 a CPU in the utilization disk of over storage 80%, limiting the speedup possible from management. Micro-Benchmarks We used a collection performance is based on Unix realistic workloads, systems. integer IV disk onds average with a system with was measurements of cleaning 8 shows SunOS for faster and while the disk’s performance faster phases of the the files back; the log-structured It substantially many ual disk [14] and read operations should for a file sequentially seeks after Sprite a the the runs 5.2 test. so the below for for for log; each large block (a newer has two been so its systems it SunOS written for cases. turns them into writes because performs to the in this it individ- groups Sprite for randomly; performance small in all of SunOS except get SunOS. many SunOS equivalent the performance sequential version the that in with than whereas performance in the LFS, 1/0, for of as CPUS be expected because faster 85% 1.2~0 of 4-6 competitive create busy means on workloads writes is also have can same in the the disk as is also in the about factor bandwidth random it a single it another LFS during This fast densely the only data. provides write busy deletes as read files kept though new are the 17% and times Sprite files SunOS efficiency it also reads, ten packs disk even by creates, the no improvement is similar in the improve a higher therefore performance case during benchmark. contrast, for that faster into An used each Section almost system used designed to the blocks require Almost has kept In phase, will 9 shows writes only was is is because file CPU. LFS LFS this create was Sprite sequential groups the Figure files. is LFS the bandwidth large that delete 8b). see of a benchmark and Sprite accesses, performance; LFS benchmark a formatted storage. In quiescent and millisec- was Sprite size. the LFS Figure Although file results of Sprite (see otherwise Sprite during disk usable segment during HBA, 17.5 of (8.7 overhead. Sprite potential was occurred best-care saturating time a one-megabyte SCS13 the while of the a Sun-4\260 a Sun SunOS files. Furthermore, phase of reading by system represent weaknesses bandwidth, SunOS, best-case file do not was megabytes small create created log. of the for order the number and 300 but and systems transfer LFS used cleaning represent Figure was and measurements a large both multiuser no strengths both the whose so they of memory, around size running the for 4.0.3, synthetic maximum For size LFS illustrate having block are used to measure it to SunOS 32 megabytes time). block Sprite they MBytes/see seek eight-kilobyte For benchmarks programs compare machine system four-kilobyte The but (1.3 file and The SPECmarks) Wren benchmark LFS FFS. two file of small of Sprite writes LFS). case The of reading case is substantially the reads lower than SunOS. Figure different file system (sequential tory, 9 illustrates form etc.); the of locality achieves reading it then ACM Transactions fact logical of files, pays that on disk extra than locality a tendency on writes, a log-structured traditional by file file assuming to use certain multiple if necessary, on Computer Systems, Vol. 10, No, 1, February system systems. access files to organize 1992. produces A traditional within patterns a direc- information a The Design and Implementation of a Log-Structured ❑ SpriteLFS Key: ❑ File System . 45 SunOS Files/see (predicted) Files/see (measured) 180 [--------------------------------------------------675 [-------------------------------------------------- . . . . .. -. . ...I.................................1111 I ... ... .. ... .. ... .. .. ... .. ... .. .. ... .. . 160 600 140 525 120 450 100 375 80 300 60 225 40 150 20 75 0 0 10000 I . . . . . . . . . .---------- . . . . . . . . . . . . . . ‘-- r 10000 lK file create file access lK ... .. .. y m) (a) Fig. 8. Small-file performance under Sprite LFS and SunOS. (a) measures a benchmark that created 10000 one-kilobyte files, then read them back in the same order as created, then deleted them. Speed is measured by the number of files per second for each operation on the two file systems. The logging approach in Sprite LFS provides an order-of-magnitude speedup for creation and deletion. (h) estimates the performance of each system for creating files on faster computers with the same disk. In SunOS the disk was 8570 saturated in (a), so faster processors will not improve performance much. In Sprite LFS the disk was only 17% saturated in (a) while the CPU was 100% utilized; as a consequence 1/0 performance will scale with CPU speed, optimally file on disk system at the same logical time then performance on from tially on logical Random disk. the reads occurred would have in the much 5.2 Cleaning micro-benchmark same should file efficiently for the random sequential same performance very differently. order as the If read the same locality differently. it writes writes in rereads in about then temporal perform because matches and have system. will or modified locality sequentially system systems a log-structured is created If temporal is written handles the out contrast, that order more the However, two sequen- to achieve efficiently. systems, if the nonsequential Sprite them even nonsequential writes then Sprite faster. Overheads performance overheads more it about laid the more pays are been that file then In on disk. as a traditional then have patterns. information closely a file writes but The of the files SunOS blocks for locality random reads though does large locality, read locality: a log-structured logical handles assumed be grouped as it sequentially, LFS the temporal will locality, differs for achieves (the results of Sprite write cost of the LFS during previous because the ACM Transactions section they benchmark give do not runs an optimistic include was any 1. O). In view cleaning order to on Computer Systems, Vol. 10, No. 1, February 1992. 46 M. Rosenblum and J K. Ousterhout . SunOS El 900 [-----""--"---"-""----"--"---------------------"""--""""-------------------"----------------------------Sprite kilobytes/see LFS 1--------------"----" -"--" "-"----'-m--------------------""--"--------------------------- 800 700 600 500 400 300 200 100 0 Read Write Sequential Write Read Reread Sequential Random Fig. 9. Large-file performance under Sprite LFS and SunOS. The figure shows the speed of a benchmark that creates a 100-Mbyte file with sequential writes, then reads the file back sequentially, then writes 100 Mbytes randomly to the existing file, then reads 100 Mbytes randomly from the file, and finally reads the file sequentially again, The bandwidth of each of the five phases is shown separately, Sprite LFS has a higher write bandwidth and the same read bandwidth as SunOS with the exception of sequential reading of a file that was written randomly. assess the policy, cost we over of cleaning recorded a period of several /user6 Home tion the effectiveness about our months. Five directories for program and systems Sprite cost-benefit were file systems measured: developers. text cleaning log-structured Workload processing, consists electronic of communica- simulations. directories processing and and /src / kernel Sources /swap2 Sprite client virtual memory and tions. of the production development, Home Ipcs and statistics Files VLSI project circuit binaries for the workstation Sprite for research on parallel kernel. swap backing tend area design. store to be large, files. for Workload 40 diskless sparse and consists Sprite accessed of workst a- consequen- tially. Itmp Temporary Table period. putting behavior predicted capacity II shows In order the file systems use file simulations in ranged from totally than the Transactions on empty. average Computer Systems, before Sprite workstations. cleaning over waited several we beginning has been 3. Even the substantially though than the segments nonempty 10, No The 1, February a overall 1992 four-month months after measurements. 11 – ‘75Y0, more utilizations. Vol 40 during Section Even disk for effects, systems the were ACM into utilizations cleaned area gathered start-up production far less storage statistics to eliminate of the by file the half than overall disk of the have write The better segments utilizations costs ranged The Design and Implementation Table II. I Segment Cleaning File system 1 /sw;p2 and Write File System Costs for Production 309 MB 13.3 MB/hour 68.1 KB 4701 65% . 47 File Systems. Write cost in Smite LFS file systems Segments Avg File Avg Write In Use ~i7~ Traffic Cleaned Empty 10732 <B 75% 69% 3.2 MB/hoor 52% —.. :: I Statistics of a Log-Structured Write 66% Atg .133 .137 cost 1.4 1.6 .535 1.6 For each Sprite LFS file system the table lists the disk size, the average file size, the average daily write traffic rate, the average disk capacity utilization, the total number of segments cleaned over a four-month period, the fraction of the segments that were empty when cleaned, theaverage utilization of thenonempty segments that were cleaned, and the overall write cost fortheperiod of the measurements. These write cost figures imply that the cleaning overhead limits the long-term write performance to about 70% of the maximum sequential write bandwidth. Fraction of segments .~.............{..............+.............+............. ... ~ 0.180 -- 0.160 --- 0.140 ‘-” 0.120 ‘-” 0.100 ‘-” 0.080 ‘- 0.060 -- 0.040 ‘-” 0.020 ““”~ (jm -. ~-. ~1 ... ............................................................ 0.0 0:2 0:4 0:6 Segment 0:8 1:0 utilization Fig. 10. Segment utilization in the /user6 tile system. This figure shows the distribution of shows large segment utilizations in a recent snapshot of the /user6 disk. The distribution numbers of fully utilized segments and totally empty segments. from 1.2 to 1.6, simulations. gathered We Sprite were in ity than and the a segments. simulated write the of the two long. In /user6 practice, The all segments. second reference ACM Transactions the difference patterns the corresponding segment utilizations, cleaning costs the files in are a substantial there and deleting in of why First, individual of 2.5-3 disk. reasons to be written segment, costs distribution simulations. tend within than are the block locality empty is that there they to shows snapshot in a single files, longer totally 10 that LFS in greater much comparison a recent believe just longer in Figure deleted will best This where produce between were case evenly lower in simulations number as a whole. In the file are the one simulation distributed of results a file or is more and real- within on Computer Systems, Vol. 10, No. 1, February 1992 48 M Rosenbium and J. K. Ousterhout . the hot are almost cold and cold segments the very every cold groups. written in in may other idle periods, We this be done. as we can improve not clean at a time, hot yet data 5.3 crash The type time for different of fifty interval and roll-forward, file of III sizes by last sizes and time to disk, in both terms than Overheads to we segments to However, inodes, inode overwritten data ACM TransactIons data about map blocks, written The to on the Computer an to the one, crashed. A checkpoint During map, the recovery the directory between II, rate one and Recovery second of files can 150 for be From a checkpoint times of size times checkpoints. recovery write by different created infinite number in Table average grow The was inode crash. written traffic in disk. the recovery updated. with the of data interval of around one megabytes/hour, every 70 seconds of LFS much inode log. disk Systems, map Vol. to consists various blocks the of file information segment We live written of the and of the of the data on 13% had to the crash and the that that table and recovered. system varies importance of the live quickly. the would of how much of the blocks. usage shows the changes production various interval a program to be added the of before used observed relative in terms of how ggy. if LFS to segregate on III data running files was Table of file by write in Sprite the amounts result maximum length. shows performed. time daily interval IV checkpoint had would recovery Other the amount maximum 5.4 on checkpoint checkpoint Table depends segment the the recovery recovery hour Using time directory files the that the an been LFS wrote and file as and Sprite limiting average second. of to know example, many ability to generated created between large installed bursts LFS For pracduring of Sprite of how system’s or during algorithms. the In night Sprite not of fixed-size never shows has being were created, bounded as operations the written at enough recover megabytes version Table over-optimistic, performance issue impact simulations, to be cleaned. a bit available with the policy may code well to configurations special the recovery works time and entries tune the it are the the the isolate data. code rate or and were will had cleaning experience expect In thus that than over-pessimistic. of the segments we think 5.1 anything, of files colder system them. and Section if enough analyzed we file clean much experience but the scenarios. ten, clean addition, gain cold the crash that have in are, numbers much Recovery Although system, In never LFS section large are modifications perform yet are reality log-structured and to carefully from Crash so do not have A of Sprite this possible activity. in received in be there segments measurements it practice segments simulations). eventually measurements tice In (cold the files segment If the the file never alone 10, No. log data accounts 1, February this all blocks and log of which for more is because 1992 written on disk represent. to the blocks, that of data occupy they written map suspect kinds they indirect consists tend than of and More to of be 770 of all the short The Design and Implementation Table III. of a Log-Structured Recovery Time for Various Sprite LFS ~covery File File System . 49 Crash Configurations. time in seconds File Data Recovered Size 1 MB 1 KB 10 MB 50 MB 21 132 1 10KB <1 100KB <1 I I 3 1 I 17 8 The table shows the speed of recovery of one, ten, and fifty megabytes of fixed-size files. The system measured was the same one used in Section 5.1. Recovery time is dominated by the number of files to be recovered. Table IV. Disk Space and Log Bandwidth Sprite Block LFS 6 file system contents /user type Live Data blocks* Indirect Usage of /user6. data [ Log bandwidth 98.0% blocks* Inode blocks* 85.2% 1 .0% 1.6% 0.2% 2.7% For each block type, the table lists the percentage of the disk space in use on disk (Live data) and the percentage of the log bandwidth consumed writing this block type (Log bandwidth). The block types marked with ‘*’ have equivalent data structures in Unix FFS. checkpoint disk interval more metadata to increase than drop the fashion, differ from Sprite unnecessary for segment garbage segment in Sprite when LFS system LFS, the we concept different storage have appeared media these inode The Sprite expect which log install forces bandwidth roll-forward metadata to overhead for recovery and interval. file LFS nefit substantially log-structured Sprite ing in We WORK ideas from many log-like structures tems on write-once only used necessary. checkpoint 6. RELATED The currently often [5, 20]. systems map LFS the file cleaning collectors selection separates and in Besides maintain inodes that systems for the and files ACM the into the Sprite writing quickly LFS design age in Sprite borrow on Computer in an much reading of the appendlike files. media the They made it space. LFS acts much much like languages [1]. during segment of blocks generations Transactions and nature log programming sorting changes information locating write-once used for all indexing to reclaim approach developed and management systems. File systems with in several proposals for building file sys- like Systems, The generational Vol. 10, No. scavengcost-becleaning garbage 1, February 1992. 50 M. Rosenblum and J. K, Ousterhout . collection tion schemes schemes generational achieve fact [11]. A significant and Sprite garbage high LFS collectors, performance that blocks algorithms for can difference is that whereas in belong a file random system. garbage file Sprite in the are LFS at a time used garbage collec- is possible accesses Also, one than these access sequential to at most identifying between efficient can to use systems in the necessary much for to exploit the simpler programming languages. The logging database crash recovery they use the the most and log. database mechanisms by changes have processed read database The been Sprite LFS crash [151. repositories because the operation to indexes log is the separate point Collecting similar to techniques the at the data home data newest in the the concept used in copy file of of copy, group not changed in read bytes in and Sprite in the in it space logged requests are without written to roll forward is than and simplified redoing ensures the that the log. to disk database systems The the systems LFS Rather writing data segment are database recovery database space. as main LFS. LFS data the compacted data. and need all log separate when the commit main-memory repository Sprite of the cache final The Since implementation final as the greatly used how The of checkpoints techniques in disk. log Sprite LFS on be reclaimed the as in mechanism to The only Sprite data do be for the reclaim can in logging view log locations. blocks recovery similar to pioneered systems purpose. they can log Typically entire the this that final from of the use LFS the differ for system area, than is Sprite to their data log” do not schemes write-ahead database state means to use but the is reserved the written rather the a database performance. logs a “redo object in the about systems of log from hurting using the [6], and area is similar systems LFS systems data occupied LFS performance database cleaning Sprite database Sprite “truth” a separate of these in all high Both is that data: area used Almost up-to-date difference for scheme systems. in large systems writes [4] is and to [7, 22]. 7. CONCLUSION The basic principle behind a log-structured file system is a simple one: collect large amounts of new data in a file cache in main memory, then write the data to disk Implementing areas in a single this idea on disk, Sprite simple but LFS suggest policy based tured file system to also works very well no cleaning overhead their entirety. The bottom line is both large 1/0 that is complicated our Transactions on simulation analysis that low cleaning on cost and benefit. overheads Although and our bandwidth. large free experience with can be achieved with a we developed a log-struc- support workloads with many small files, the approach for large-file accesses. In particular, there is essentially at all for very large files that are created and deleted in that a log-structured of magnitude more efficiently possible to take advantage ACM can use all of the disk’s by the need to maintain Computer file system can use disks than existing file systems. of several more generations Systems, Vol 10, No. 1, February 1992, an order This should make it of faster processors The Design before 1/0 and Implementation limitations once again of a Log-Structured threaten the File System scalability . of 51 computer systems. ACKNOWLEDGMENTS Diane Jim Green, Mary Mott-Smith Baker, provided John Hartman, helpful Douglis participated in the log-structured file systems. Mike comments initial Kupfer, on drafts discussions and Ken Shirriff, and of this paper. Fred experiments examining REFERENCES in real time on a serial computer, A.I. Working Paper 139, 1. BAKER, H. G. List processing MIT-AI Lab, Boston, MA, Apr. 1977. 2. BAKER, M. G., HARTMAN, J. H., KUPFER, M. D., SHIRRIFF, K, W., AND OUSTERHOUT, J. K. of the 13th Symposium on Measurements of a distributed file system. In Proceedings Operating Systems Principles (Pacific Grove, Calif., Oct. 1991), ACM. pp. 198-212. of 3. CHANG, A., MERGEN, M. F., RADER, R. K., ROBERTS, J. A., AND PORTER, S. L. Evolution storage facilities in AIX Version 3 for RISC System/6000 processors. IBM J. Res. Deu. 34, 1 (Jan. 1990), 105-109. 4. DEWITT, D. J., KATZ, R. H., OLKEN, F., SHAPIRO. L. D., STONEBRAKERM. R., .ANDWOOD, D. of SZGMOD Implementation techniques for main memory database systems. In Proceedings 1984 (Jun. 1984), pp. 1-8. 5. FINLAYSON, R. S. AND CHERITON, D. R. Log files: An extended file service exploiting of the 11th Symposvum on Operating Systems Principles write-once storage. In Proceedings (Austin, Tex., Nov. 1987), ACM, pp. 129-148. systems. In Operating Systems, An Aduanced 6. GRAY, J. Notes on data base operating Course. Springer-Verlag, Berlin, 1979, pp. 393-481. database system, IEEE 7. HAGMANN, R. B. A crash recovery scheme for a memory-resident Trans. Comput. C-35, 9 (Sept. 1986), 000-000. 8. HAGMANN, Proceedings R. B. of the Reimplementing 11th Symposium the cedar on file Operating system using Systems logging Principles and group (Austin, commit. Tex,, In Nov. 1987), pp. 155-162. 9. KAZAR, M. L , LEVERET~, B. W., ANDERSON, O. T , APOSTO1.IDES,V., BOTTOS, B. A CHUTANI, file system S., EVERHART, C. F., MASON, W. A.j Tu, S. T., AND ZAYAS, E. R. Decorum architectural overview. In Proceedings of the USENIX 1990 Summer Conference (Anaheim, Calif., Jun. 1990), pp. 151-164. 10. LAZOWSKA, E. D., ZAHORJAN, J., CHERITON, D. R., AND ZWAENEPOEL, W. File access perforTrans. Comput. Syst. 4, 3 (Aug. 1986), 238-268. mance of diskless workstations. garbage collector based on the lifetimes of 11. LIEBERMAN, H AND HEWITT, C. A real-time ACM. 26, 6 (June 1983), 419-429. objects. Commun. Comput. Syst. 2, 3 (Aug. 1984), A fast file system for Unix. Trans. 12. MCKUSICK, M. K. 181-197. UNIX file 13. MCKUSICK, M K., JoY, W. N., LEFFLER, S. J., AND FABRY, R. S. Fsck–The Manual—4.3 BSD Virtual VAX-11 Version, system check program. Unix System Manager’s USENIX, Berkeley, Calif., Apr. 1986. performance from a UNIX file system. In Proceed14. McVOY, L. AND KLEIMAN, S. Extent-like ings of the USENIX 1991 Winter Conference (Dallas, Tex., Jan, 1991), 15. OKI, B. M., LISKOV, B. H., AND SCHEIFLER, R. W. Reliable object storage to support atomic of the 10th Symposium on Operating Systems Principles (Orcas actions. In Proceedings Island, Wash., Dec. 1985) ACM, pp. 147-159. Why aren’t operating systems getting faster as fast as hardware? In 16. OUSTERHOUT, J. K. Proceedings of the USENIX 1990 Summer Conference (Anaheim, Calif., Jun. 1990), pp. 247-256. ACM Transactions on Computer Systems, Vol. 10, No 1, February 1992. 52 . M. Rosenblum and J. K. Ousterhout 17. OUSTERHOUT, J. K , CHERENSON, A, R., DOUGLIS, F., NELSON, M N , AND WELCH, B. B. The Sprite Network Operating System. IEEE Comput. 21, 2 (Feb. 1988), 23-36. 18. OUSTERHOUT,J. K , DA COSTA, H., HARRISON, D., KUNZE, J A., KUPFER, M., AND THOMPSON, J. G A trace-driven analysis of the Unix 4.2 BSD file system. In Proceedings of the 10th Symposium on Operating Systems prlnctples (Orcas Island, Wash., 1985), ACM, pp. 15-24. A case for redundant arrays of inexpensive 19. PATTERSON,D. A.j GIBSON, G., AND KATZ, R. H disks (RAID). ACM SIGMOD 88, pp. 109-116, (Jun. 1988). 20. REED, D. AND SVOBODOVA, L. SWALLOW: A distributed data storage system for a local Networks for Computer Communieatcons. North-Holland, 1981, pp. network. In Local 355-373. of 21. SANDBERG,R. Design and implementation of the Sun network filesystem. In Proceedings the USENZX 1985 Summer Conference (Portland, Ore , Jun. 1985) pp. 119-130. 22 SALEM, K. AND GARCIA-M• LINA, H, Crash recovery mechamsms for main storage database systems. CS-TR-034-86, Princeton Univ., Princeton, NJ 1986. of the 23. SATYANARAYANAN, M A study of file sizes and functional lifetimes. In Proceedings 8th Symposium on Operating Systems Principles (Pacific Grove, Calif., Dec. 1981), ACM, pp. 96-108. 24, SELTZER, M. I., CHEN, P, M., AND OUSTERHOUT,J. K Disk scheduling revisited. In Proceedings of the Winter 1990 USENIX Tech meal Conference (Jan. 1990), pp. Received June 1991; revised ACM TransactIons on Computer September Systems, 1991; accepted September Vol. 10, No. 1, February 1991 1992.