Conquest: Better Performance Through A Disk/Persistent-RAM Hybrid File System USENIX 2002 An-I Andy Wang • Peter Reiher • Gerald Popek University of California, Los Angeles Geoffrey Kuenning Harvey Mudd College Conquest Overview File systems are optimized for disks Performance problem Complexity Now we have tons of inexpensive RAM What can we do with that RAM? 2 Conquest Approach Combine disk and persistent RAM (e.g., battery-backed RAM) in a novel way Simplification > 20% fewer semicolons than ext2, reiserfs, and SGI XFS Performance (under popular benchmarks) 24% to 1900% faster than LRU disk caching 3 Motivation Most file systems are built for disks Problems with the disk assumption: Performance Complexity Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 4 Hardware Evolution CPU (50% /yr) Memory (50% /yr) 1 GHz Accesses 1 MHz Per Second 1 KHz (Log Scale) 1990 (1 sec : 6 days) 106 105 Disk (15% /yr) 1995 2000 (1 sec : 3 months) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 5 Inside the Pandora’s Box Disk arm Disk platters Access time = seek time (disk arm) + rotational delay (disk platter) + transfer time Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 6 Disk Optimization Methods Disk arm scheduling Group information on disk Disk readahead Buffered writes Disk caching Data mirroring Hardware parallelism Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 7 Complexity Bytes predictive readahead synchronization cache replacement elevator algorithm data consistency asynchronous write data clustering Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 8 Storage Media Alternatives $/MB (log) Magnetic RAM? 10-3 100 10-3 tape 103 disk 106 accesses/sec (log) battery-backed DRAM (write once) flash memory persistent RAM Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Caceres et al., 1993; Hillyer et al., 1996; Qualstar 1998; Tanisys 1999; Micron Semiconductor Products 2000; Quantum 2000] 9 Price Trend of Persistent RAM 102 101 $/MB (log) 100 10-1 10-2 1995 Booming of digital photography 4 to 10 GB of persistent RAM paper/film Persistent RAM 1” HDD 3.5” HDD 2.5” HDD 2000 Year 2005 Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Grochowski 2000] 10 Old Order; New World Disk staying around RAM as a viable storage alternative Cost, capacity, power, heat PDAs, digital cameras, MP3 players More architectural changes due to RAM A big assumption change from disk Rethink data structures, interface, applications Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 11 Getting a Fresh Start What does it take to design and build a system that assumes ample persistent RAM as the primary storage medium? Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 12 Conquest Design and build a disk/persistent-RAM hybrid file system Deliver all file system services from memory, with the exception of high-capacity storage Benefits: Simplicity Performance Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 13 Simplicity Remove disk-related complexities for most files Make things simpler for disk as well Less complexity Fewer bugs Easier maintenance Shorter data path Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 14 Performance Overall Memory data path All management performed in memory No disk-related overhead Disk data path Faster speed due to simpler access models Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 15 Conquest Components Media management Metadata management Allocation service Persistence support Resiliency support Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 16 User Access Patterns Small files Large files Take little space (10%) Represent most accesses (90%) Take most space Mostly sequential accesses Except database applications Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Iram 1993; Douceur et al., 1999; Roselli et al., 2000] 17 Files Stored in Persistent RAM Small files (< 1MB) Metadata No seek time or rotational delays Fast byte-level accesses Contiguous allocation Fast synchronous update No dual representations Executables and shared libraries In-place execution Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 18 Memory Data Path of Conquest Conventional file systems Conquest Memory Data Path Storage requests Storage requests IO buffer management Persistence support IO buffer Battery-backed RAM Persistence support Small file and metadata storage Disk management Disk Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 19 Large-File-Only Disk Storage Allocate in big chunks Lower access overhead Reduced management overhead No fragmentation management No tricks for small files Storing data in metadata No elaborate data structures Wrapping a balanced tree onto disk cylinders Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Devlinux.com 2000] 20 Sequential-Access Large Files Sequential disk accesses Near-raw bandwidth Well-defined readahead semantics Read-mostly Little synchronization overhead (between memory and disk) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 21 Disk Data Path of Conquest Conventional file systems Conquest Disk Data Path Storage requests Storage requests IO buffer management IO buffer management IO buffer Persistence support IO buffer Battery-backed RAM Small file and metadata storage Disk management Disk management Disk Disk Large-file-only file system Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 22 Random-Access Large Files Random access? Common definition: nonsequential access A typical movie has 150 scene changes MP3 stores the title at the end of the files Near Sequential access? Simplify large-file metadata representation significantly Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 23 Logical File Representation Name(s) i-node File attributes Data File Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 24 Physical File Representation Name(s) i-node File attributes Data locations Data blocks File Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 25 Ext2 Data Representation data block location data block location data block location data block location 10 index block location index block location index block location i-node Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 26 Problems with Ext2 Design - Designed for disk storage - Optimization for small files makes things complex - Random-access data structure for large files that are accessed mostly sequentially - Data access time dependent on the byte position in a file - Maximum file size is limited Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 27 Conquest Representation Persistent RAM Hash(file name) = location of data Offset(location of data) Disk storage Per-file, doubly linked list of disk block segments (stored in persistent RAM) Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 28 Conquest Design + Direct data access for in-core files + Worse case: sequential memory search for infrequent random accesses to on-disk files + Maximum file size limited by physical storage Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 29 Implementation Status Kernel module under Linux 2.4.2 Fully functional and POSIX compliant Modified memory manager to support Conquest persistence Preparing for office-wide deployment Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 30 Conquest Evaluation Architectural simplification Feature count Performance improvement Memory-only workload Memory and disk workload Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 31 Conventional Data Path Conventional file systems Storage requests IO buffer management IO buffer Persistence support Disk management Disk Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 32 Memory Path of Conquest Conquest Memory Data Path Storage requests Persistence support Battery-backed RAM Small file and metadata storage Memory manager encapsulation Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 33 Disk Path of Conquest Conquest Disk Data Path Storage requests IO buffer management Battery-backed IO buffer RAM Small file and metadata storage Disk management Disk Large-file-only file system Buffer allocation management Buffer garbage collection Data caching Metadata caching Predictive readahead Write behind Cache replacement Metadata allocation Metadata placement Metadata translation Disk layout Fragmentation management Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 34 PostMark Benchmark ISP workload (emails, web-based transactions) Conquest is comparable to ramfs At least 24% faster than the LRU disk cache 9000 8000 7000 6000 5000 trans / sec 4000 3000 2000 1000 0 250 MB working set with 2 GB physical RAM 5000 10000 15000 20000 25000 30000 files SGI XFS reiserfs ext2fs ramfs Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion [Katcher 1997; Sweeney et al., 1996; Card et al., 1999; Namesys 2002] 35 PostMark Benchmark When both memory and disk components are exercised, Conquest can be several times faster than ext2fs, reiserfs, and SGI XFS 5000 4000 <= RAM > RAM 10,000 files, 3.5 GB working set with 2 GB physical RAM 3000 trans / sec 2000 1000 0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 percentage of large files SGI XFS reiserfs ext2fs Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 36 PostMark Benchmark When working set > RAM, Conquest is 1.4 to 2 times faster than ext2fs, reiserfs, and SGI XFS 10,000 files, 3.5 GB working set with 2 GB physical RAM 120 100 80 trans / sec 60 40 20 0 6.0 7.0 8.0 9.0 10.0 percentage of large files SGI XFS reiserfs ext2fs Conquest Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 37 Lessons Learned Faster than LRU caching, unexpected Heavyweight disk handling Severe penalty for accesses to content Matching user access patterns to storage media offers considerable simplification and better performance Not an automatic result Need careful design Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 38 Conclusion Conquest demonstrates how rethinking changes in underlying assumptions can lead to significant architectural and performance improvements Radical changes in hardware, applications, and user expectations in the past decade should lead us to rethink other aspects of OS as well. Motivation – Conquest Design – Conquest Components – Performance Evaluation – Conclusion 39 Questions . . . Conquest: http://lasr.cs.ucla.edu/conquest Andy Wang: awang@cs.ucla.edu 40