Conquest: Preparing for Life After Disks October 2, 2003 An-I Andy Wang Conquest Overview File systems are optimized for disks Performance problem Complexity Now we have tons of inexpensive RAM What can we do with that RAM? 2 Conquest Approach Combine disk and persistent RAM (e.g., battery-backed RAM) in a novel way Simplification At least 20% smaller code base than ext2, reiserfs, and SGI XFS Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching Best performance boost since Berkeley FFS 3 Performance Problem of Disks CPU (50% /yr) memory (50% /yr) 1 GHz accesses 1 MHz per second 1 KHz (log scale) 1990 (1 sec : 6 days) 106 105 disk (15% /yr) 1995 2000 (1 sec : 3 months) Genesis • Conquest Design • Performance Evaluation • Conclusion 4 Inside Pandora’s Box Disk arm Disk platters Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time Genesis • Conquest Design • Performance Evaluation • Conclusion 5 Disk Optimization Methods Disk arm scheduling Group information on disk Disk readahead Buffered writes Disk caching Data mirroring Hardware parallelism Genesis • Conquest Design • Performance Evaluation • Conclusion 6 Complexity Bytes predictive readahead synchronization cache replacement elevator algorithm data consistency asynchronous write data clustering Genesis • Conquest Design • Performance Evaluation • Conclusion 7 Storage Media Alternatives $/MB (log scale) magnetic RAM? 10-3 100 10-3 tape 10-6 103 disk 106 accesses/sec (log scale) battery-backed DRAM (write once) flash memory persistent RAM Genesis • Conquest Design • Performance Evaluation • Conclusion [Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Quantum 2000; Micron Semiconductor Products 2002] 8 The Genesis of Conquest Idea: persistent-RAM-only file system Improved performance Remove disk-related complexity Genesis • Conquest Design • Performance Evaluation • Conclusion 9 The Genesis of Conquest (2) Problem: wrong growth curves Disk prices dropping faster than RAM prices Disks will stay around booming of digital photography 102 101 $/MB 100 (log scale) 10-1 10-2 1995 persistent RAM 1" HDD 3.5" HDD 2.5" HDD 2000 year 2005 Genesis • Conquest Design • Performance Evaluation • Conclusion [Grochowski 2002] 10 The Genesis of Conquest (3) New idea: hybrid system for transition Takes advantage of RAM speed Still simplifies code booming of digital photography 102 101 $/MB 100 (log scale) 10-1 10-2 1995 4 to 10 GB of persistent RAM paper/film persistent RAM 1" HDD 3.5" HDD 2.5" HDD 2000 year 2005 Genesis • Conquest Design • Performance Evaluation • Conclusion [Grochowski 2002] 11 Conquest Design Questions How to make effective use of RAM? Where and how to reduce complexity? Common usage patterns Physical characteristics of RAM storage Data paths Data structures and associated management Shutdown/boot sequence How to assure the integrity of file system components that reside in BB-DRAM? Genesis • Conquest Design • Performance Evaluation • Conclusion 12 User Access Patterns Small files Large files Take little space (10%) Represent most accesses (90%) Take most space Mostly sequential accesses Not characteristic of database applications Genesis • Conquest Design • Performance Evaluation • Conclusion [Ousterhout 1985; Baker et al., 1991; Iram 1993; Douceur and Bolosky 1999; Roselli et al., 2000; Evans and Kuenning 2002] 13 Characteristics of Storage Media RAM Fast random accesses Cost-effective in performance Disk Fast sequential accesses Cost-effective in storage Genesis • Conquest Design • Performance Evaluation • Conclusion 14 The Design of Conquest Deliver all file system services from memory, with the exception of high-capacity storage Persistent RAM Data content of small files (smaller than 1 MB) Metadata (file descriptions for large and small files, directories, and data structures) Disk Data content of large files Two separate data paths to memory and disk Genesis • Conquest Design • Performance Evaluation • Conclusion 15 Conquest Alternatives Disk caching Assumption of scarce memory Use disk as the final storage destination Complex mechanisms to maintain consistency RAM drives and RAM file systems Not meant to be persistent Use disk-related mechanisms Limitations on storage capacity Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [McKusick et al., 1990; Ganger et al., 2000; Roselli et al., 2000; Seltzer et al., 2000] 16 Simplification of Data Paths Genesis • Conquest Design • Performance Evaluation • Conclusion 17 Content of Persistent RAM Data content of small files (< 1MB) No seek time or rotational delays Fast byte-level accesses Virtual contiguous allocation Metadata (e.g., directories, file system states) Fast synchronous update No dual representations For both large and small files Genesis • Conquest Design • Performance Evaluation • Conclusion 18 Memory Data Path of Conquest Conventional File Systems Conquest Memory Data Path storage requests storage requests I/O buffer management persistence support I/O buffer battery-backed RAM persistence support small file and metadata storage disk management disk Genesis • Conquest Design • Performance Evaluation • Conclusion 19 Large-File-Only Disk Storage Only store the data content of large files Allocate in big chunks Lower access overhead Reduced management overhead No fragmentation management No tricks for small files Storing data in metadata No elaborate data structures Wrapping a balanced tree onto disk cylinders Genesis • Conquest Design • Performance Evaluation • Conclusion [Namesys 2002] 20 Sequential-Access Large Files Sequential disk accesses Near-raw bandwidth Well-defined readahead semantics Read-mostly Little synchronization overhead (between memory and disk) Genesis • Conquest Design • Performance Evaluation • Conclusion 21 Disk Data Path of Conquest Conventional File Systems Conquest Disk Data Path storage requests storage requests I/O buffer management I/O buffer management I/O buffer persistence support I/O buffer battery-backed RAM small file and metadata storage disk management disk management disk disk large-file-only file system Genesis • Conquest Design • Performance Evaluation • Conclusion 22 Random-Access Large Files Random access? Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files Near sequential access? Simplifies large-file metadata representation significantly Genesis • Conquest Design • Performance Evaluation • Conclusion [Baker et al., 1991; Vogels 1999; Roselli et al., 2000] 23 Simplification of Data Structures Genesis • Conquest Design • Performance Evaluation • Conclusion 24 Logical File Representation Name(s) i-node File attributes Data File Genesis • Conquest Design • Performance Evaluation • Conclusion 25 Physical File Representation Name(s) i-node File attributes Data locations Data blocks File Genesis • Conquest Design • Performance Evaluation • Conclusion 26 Ext2 Data Representation data block location data block location data block location data block location 10 index block location index block location index block location i-node (stored on disk) Genesis • Conquest Design • Performance Evaluation • Conclusion 27 Disadvantages with Ext2 Design Optimization for small files makes things complex Designed for disk storage Random-access data structure for large files that are accessed mostly sequentially Data access time dependent on the byte position in a file Maximum file size is limited Genesis • Conquest Design • Performance Evaluation • Conclusion 28 Conquest Representation Persistent RAM Single-level dynamically allocated index index array location i-node (stored in RAM) data block location data block location Fast data access for files stored in RAM Genesis • Conquest Design • Performance Evaluation • Conclusion 29 Conquest Representation (2) Disk segment list location i-node (stored in RAM) begin block location end block location begin block location (stored on disk) end block location Worst case: sequential memory search for random disk locations Maximum file size limited by physical storage Genesis • Conquest Design • Performance Evaluation • Conclusion 30 Conquest Directories Per-directory hash tables stored in memory Collisions resolved by rehashing Hard links: multiple names point to same data Problem: Dynamic resizing of directories Need to handle the current file position Important for rm -fr Genesis • Conquest Design • Performance Evaluation • Conclusion 31 The Difficulty With Shrinking rm –fr hash table location 1000 |<empty> dirfile1 file i-node i-node location location NULL i-node (stored in RAM) 1001 |<empty> file2 file1 NULL file i-node i-node location location 0110 |<empty> file1 file1 file i-node i-node location location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 32 The Difficulty With Shrinking rm -fr hash table location <deleted> NULL i-node (stored in RAM) 1001 |<empty> file2 file1 NULL file i-node i-node location location 0110 |<empty> file1 file1 file i-node i-node location location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 33 The Difficulty With Shrinking rm -fr hash table location <deleted> NULL i-node (stored in RAM) 1001 |<empty> file2 file1 NULL file i-node i-node location location 0110 |<empty> file1 file1 file i-node i-node location location NULL <deleted> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 34 The Difficulty With Shrinking rm -fr hash table location 0110 |<empty> file1 file1 file i-node i-node NULL location location i-node (stored in RAM) 1001 |<empty> file2 file1 file i-node i-node NULL location location Genesis • Conquest Design • Performance Evaluation • Conclusion 35 The Difficulty With Shrinking rm -fr hash table location 0110 |<empty> file1 file1 file i-node i-node NULL location location i-node (stored in RAM) <empty> NULL Quick fixes Never shrink hash tables (for rm –fr) No promises for ls while adding files Genesis • Conquest Design • Performance Evaluation • Conclusion 36 Extensible Hash Tables Use top, not bottom, bits of hash code hash table location 0110 |<empty> file1 file1 file i-node i-node NULL location location i-node (stored in RAM) 1001 |<empty> file2 file1 file i-node i-node NULL location location Genesis • Conquest Design • Performance Evaluation • Conclusion [Fagin et al., 1979] 37 Extensible Hash Tables Preserve ordering of entries when resizing hash table location <empty> NULL i-node (stored in RAM) 0110 | <empty> file1 file1 NULL file i-node i-node location location 1001 |<empty> file2 file1 file i-node i-node location location NULL <empty> NULL Genesis • Conquest Design • Performance Evaluation • Conclusion 38 Additional Engineering Details Dynamic file positioning Need to handle collisions Memory overhead and complexity tradeoffs Genesis • Conquest Design • Performance Evaluation • Conclusion 39 Simplification of Metadata Management Genesis • Conquest Design • Performance Evaluation • Conclusion 40 Metadata Allocation Requirements Keep track of usage status of metadata entries Avoid duplicate allocation with unique IDs Fast retrieval of metadata with a given ID ID: 30| free ID: 81| in use ID: 58| free ID: 16| free ID: 89| in use ID: 88| free Genesis • Conquest Design • Performance Evaluation • Conclusion 41 Existing Memory Allocation Services Keep track of unallocated memory No duplicate allocation of physical addresses Hmm… ADDR 0xe000000| free ADDR 0xe000038| in use ADDR 0xe000070| free ADDR 0xe0000A8| free ADDR 0xe0000E0| free ADDR 0xe000118| in use Genesis • Conquest Design • Performance Evaluation • Conclusion 42 Conquest Metadata Management Metadata = memory allocated by memory manager Metadata ID = physical address of metadata Unique IDs and fast retrieval ID: 30| free ADDR 0xe000000| free ID: 81| in use ADDR 0xe000038| in use ID: 58| free ADDR 0xe000070| free ID: 16| free ADDR 0xe0000A8| free ID: 89| in use ADDR 0xe0000E0| free ID: 88| free ADDR 0xe000118| in use Usage status Genesis • Conquest Design • Performance Evaluation • Conclusion 43 Simplification of Shutdown/Boot Sequence Genesis • Conquest Design • Performance Evaluation • Conclusion 44 Persistence Support Restore file system states after a reboot Data Metadata Memory manager Keep track of metadata allocation Reinitialized at boot time No knowledge of persistently allocated data Genesis • Conquest Design • Performance Evaluation • Conclusion 45 Linux Memory Manager Page allocator maintains individual pages Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 46 Linux Memory Manager (2) Zone allocator allocates memory in power-oftwo sizes Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 47 Linux Memory Manager (3) Slab allocator groups allocations by sizes to reduce internal memory fragmentation Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 48 Memory Allocation Example Allocate a 455-byte data structure Slab allocator One page of data structures Zone allocator One page from DMA zone Page allocator Page address 0x0000d000 Genesis • Conquest Design • Performance Evaluation • Conclusion 49 Linux Memory Manager (4) Difficult to restore the persistent states Three layers of pointer-rich mappings Mixing of persistent and temporary allocations Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 50 Conquest Persistence Create memory zones with own instantiations of memory managers Slab allocator Zone allocator Page allocator Genesis • Conquest Design • Performance Evaluation • Conclusion 51 Conquest Persistence Reuse existing memory manager code Encapsulate all pointers within each zone Pointers can survive reboots No serialization and deserialization Swapping and paging Disabled for Conquest memory zones Enabled for non-Conquest zones Genesis • Conquest Design • Performance Evaluation • Conclusion 52 Integrity of Content in RAM User-level program crashes Same file system interface as others Access control Memory protection Operating system crashes 1.5% of crashes lead to memory corruption Lose about one data block a decade Genesis • Conquest Design • Performance Evaluation • Conclusion [Ng et al., 1996] 53 Other Reliability Mechanisms Instantaneous metadata commit Daily backups Pointer-switch commit semantics pointer Genesis • Conquest Design • Performance Evaluation • Conclusion 54 Implementation Status Kernel module under Linux 2.4.2 Operational and POSIX compliant Modified memory manager to support Conquest persistence Need to overcome BIOS limitations for distribution Genesis • Conquest Design • Performance Evaluation • Conclusion 55 Performance Evaluation Architectural simplification Feature count Performance improvement Memory-only workloads Memory-and-disk workloads Genesis • Conquest Design • Performance Evaluation • Conclusion 56 Conventional Data Path Conventional File Systems storage requests I/O buffer management I/O buffer persistence support disk management disk Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 57 Memory Path of Conquest Conquest Memory Data Path storage requests Persistence support battery-backed RAM small file and metadata storage Memory manager encapsulation Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 58 Disk Path of Conquest Conquest Disk Data Path storage requests I/O buffer management battery-backed I/O buffer RAM small file and metadata storage disk management disk large-file-only file system Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Genesis • Conquest Design • Performance Evaluation • Conclusion 59 PostMark Benchmark (1) ISP workload (emails, web-based transactions) Conquest is comparable to ramfs At least 24% faster than the LRU disk cache 9000 8000 7000 6000 5000 trans / sec 4000 3000 2000 1000 0 40 to 250 MB working set with 2 GB physical RAM 5000 10000 15000 20000 25000 30000 files SGI XFS reiserfs ext2fs ramfs Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion [Card et al., 1994; Sweeney et al., 1996; Katcher 1997; Namesys 2002] 60 PostMark Benchmark (2) When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS 10,000 files, 80 MB to 3.5 GB working set with 2 GB physical RAM 5000 4000 <= RAM > RAM 3000 trans / sec 2000 1000 0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 percentage of large files SGI XFS reiserfs ext2fs Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 61 PostMark Benchmark (3) When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS 10,000 files, 80 MB to 3.5 GB working set with 2 GB physical RAM 120 100 80 trans / sec 60 40 20 0 6.0 7.0 8.0 9.0 10.0 percentage of large files SGI XFS reiserfs ext2fs Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 62 Sprite LFS Microbenchmarks Small-file benchmark Operates on 10,000 1-KB files in three phases 180000 160000 140000 120000 100000 op / sec 80000 60000 40000 20000 0 create SGI XFS read reiserfs ext2fs delete ramfs Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion [Rosenblum and Ousterhout 1991] 63 Sprite LFS Microbenchmarks (2) Modified large-file microbenchmark: ten 1-MB files (Conquest in-core files) 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write reiserfs ext2fs rand read ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 65 Sprite LFS Microbenchmarks (3) Modified large-file microbenchmark: ten 1.01-MB files (Conquest on-disk files) 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write rand read reiserfs ext2fs ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 66 Sprite LFS Microbenchmarks (4) Large-file microbenchmark: forty 100-MB files (Conquest on-disk files) 30 25 20 MB / sec 15 10 5 0 seq write seq read SGI XFS rand write reiserfs rand read ext2fs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 67 istory’s Mystery Puzzling Microbenchmark Numbers… Geoff Kuenning: “If Conquest is slower than ext2fs, I will toss you off of the balcony…” Genesis • Conquest Design • Performance Evaluation • Conclusion 68 With me hanging off a balcony… Original large-file microbenchmark: one 1-MB file (Conquest in-core file) 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write rand read reiserfs ext2fs ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 69 Odd Microbenchmark Numbers Why are random reads slower than sequential reads? 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write rand read reiserfs ext2fs ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 70 Odd Microbenchmark Numbers Why are RAM-based file systems slower than disk-based file systems? 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write rand read reiserfs ext2fs ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 71 A Series of Hypotheses Warm-up effect? Bad initial states? Maybe Why do RAM-based systems warm up slower? No Pentium III streaming I/O option? No Genesis • Conquest Design • Performance Evaluation • Conclusion [Keshava and Penkovski 1999; Torvalds 2001; Abraham 2002] 72 Effects of L2 Cache Footprints Large L2 cache footprint footprint Small L2 cache footprint footprint write a file sequentially footprint file end write a file sequentially footprint read the same file sequentially footprint read file file end read the same file sequentially footprint flush file end read file flush file end Genesis • Conquest Design • Performance Evaluation • Conclusion 73 LFS Sprite Microbenchmarks Modified large-file microbenchmark: ten 1-MB files (Conquest in-core files) 700 600 500 MB / sec 400 300 200 100 0 seq write seq read SGI XFS rand write reiserfs ext2fs rand read ramfs seq read Conquest Genesis • Conquest Design • Performance Evaluation • Conclusion 74 Related Work Main-Memory Databases Memory-based data structures and query mechanisms File-system applications of persistent RAM Write buffers Flash-memory-based file systems Disk emulators Rio file cache MRAM enabled storage Genesis • Conquest Design • Performance Evaluation • Conclusion [Baker et al., 1992; Garcia-Molina and Salem 1992; Wu and Zwaenepoel 1994; Chen et al., 1996; Riedel 1998; Quantum 2000; Miller et al., 2001] 76 Related Work (2) PDA operating systems Designed with severe memory constraints Slice Distributed storage system Dedicated servers for metadata, small files, and large files Genesis • Conquest Design • Performance Evaluation • Conclusion [Anderson et al., 2000; Palm 2000; IBM 2002; Microsoft 2002] 77 Lessons Learned Faster than LRU caching, unexpected Heavyweight disk handling Severe penalty for accessing memory content Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design Genesis • Conquest Design • Performance Evaluation • Conclusion 78 More Lessons Learned Effects of L2 caching become highly visible in memory workloads (modern workloads) Cannot blindly apply existing disk-based microbenchmarks to measure memory performance of file systems Need to consider states of L2 cache and memory behaviors at each stage of microbenchmarking Genesis • Conquest Design • Performance Evaluation • Conclusion 79 Additional Lessons Learned Don’t discuss your performance numbers next to a balcony…unless… Genesis • Conquest Design • Performance Evaluation • Conclusion 80 Going Beyond Conquest Matching usage patterns with heterogeneous machines in the distributed domain Specialized tasks for machines within a cluster Preferably self-organizing and self-evolving State-rich computing Caching of runtime data structures Similar to specialized temporary file system Genesis • Conquest Design • Performance Evaluation • Conclusion 81 Going Beyond Conquest (2) Separate storage of metadata from data Benchmarking memory performance of file systems Opportunity for hierarchical replication across devices with different calibers Developing new memory benchmarks Why are modern operating systems so complicated? More places to expand Conquest approach Genesis • Conquest Design • Performance Evaluation • Conclusion 82 Contributions Demonstrated the feasibility of disk-memory hybrid file systems Showed performance does not preclude simplicity Pinpointed cache-related problems with modern benchmarks Opened doors to many exciting areas of research Genesis • Conquest Design • Performance Evaluation • Conclusion 83 Conclusion Conquest demonstrates how rethinking changes in underlying assumptions can lead to significant architectural and performance improvements Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well. Genesis • Conquest Design • Performance Evaluation • Conclusion 84 Questions . . . Conquest: http://www.cs.fsu.edu/~awang/conquest Andy Wang: awang@cs.fsu.edu 85