A framework for implementing IO-bound maintenance applications Eno Thereska PARALLEL DATA LABORATORY Carnegie Mellon University Disk maintenance applications • Lots of disk maintenance apps • data protection (backup) • storage optimization (defrag, load bal.) • caching (write-backs) • Important for system robustness • Background activities • should eventually complete • ideally without interfering with primary apps http://www.pdl.cmu.edu/ 2 Eno Thereska November 2004 Current approaches Client workload Backup throughput MB/s • Implement maintenance application as foreground application • competes for bandwidth or off-hours only http://www.pdl.cmu.edu/ 3 Eno Thereska November 2004 Current approaches Client workload Cache writeback throughput MB/s • Trickle maintenance activity periodically • lost opportunities due to inadequate scheduling decisions http://www.pdl.cmu.edu/ 4 Eno Thereska November 2004 Real support for background applications • Push maintenance activities in the background • priorities and explicit support for them • APIs allow application expressiveness • Storage subsystem does the scheduling • using idle time, if there is any • using otherwise-wasted rotational latency in a busy system http://www.pdl.cmu.edu/ 5 Eno Thereska November 2004 Outline • • • • • Motivation and overview The freeblock subsystem Background application interfaces Example applications Conclusions http://www.pdl.cmu.edu/ 6 Eno Thereska November 2004 The freeblock subsystem • Disk scheduling subsystem supporting new APIs with explicit background requests • Finds time for background activities • by detecting idle time (short and long bursts) • by utilizing otherwise-wasted rotational latency in a busy system http://www.pdl.cmu.edu/ 7 Eno Thereska November 2004 After reading blue sector After BLUE read http://www.pdl.cmu.edu/ 8 Eno Thereska November 2004 Red request scheduled next After BLUE read http://www.pdl.cmu.edu/ 9 Eno Thereska November 2004 Seek to Red’s track After BLUE read http://www.pdl.cmu.edu/ Seek for RED 10 Eno Thereska November 2004 Wait for Red sector to reach head After BLUE read http://www.pdl.cmu.edu/ Seek for RED Rotational latency 11 Eno Thereska November 2004 Read Red sector After BLUE read http://www.pdl.cmu.edu/ Seek for RED Rotational latency 12 After RED read Eno Thereska November 2004 Traditional service time components After BLUE read Seek for RED Rotational latency After RED read • Rotational latency is wasted time http://www.pdl.cmu.edu/ 13 Eno Thereska November 2004 Rotational latency gap utilization After BLUE read http://www.pdl.cmu.edu/ 14 Eno Thereska November 2004 Seek to Third track After BLUE read Seek to Third SEEK http://www.pdl.cmu.edu/ 15 Eno Thereska November 2004 Free transfer After BLUE read SEEK http://www.pdl.cmu.edu/ Seek to Third Free transfer FREE TRANSFER 16 Eno Thereska November 2004 Seek to Red’s track After BLUE read SEEK http://www.pdl.cmu.edu/ Seek to Third Free transfer FREE TRANSFER 17 Seek to RED SEEK Eno Thereska November 2004 Read Red sector After BLUE read SEEK http://www.pdl.cmu.edu/ Seek to Third Free transfer FREE TRANSFER 18 Seek to RED After RED read SEEK Eno Thereska November 2004 Steady background I/O progress 40 from idle time 35 from rotational gaps “Free” MB/s 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 80 90 100 % disk utilization by foreground (random 4KB) reads/writes http://www.pdl.cmu.edu/ 19 Eno Thereska November 2004 The freeblock subsystem (cont…) • Implemented in FreeBSD • Efficient scheduling • low CPU and memory utilizations • Minimal impact on foreground workloads • < 2% • See refs for more details http://www.pdl.cmu.edu/ 20 Eno Thereska November 2004 Outline • • • • • Motivation and overview The freeblock subsystem Background application interfaces Example applications Conclusions http://www.pdl.cmu.edu/ 21 Eno Thereska November 2004 Application programming interface (API) goals • Work exposed but done opportunistically • all disk accesses are asynchronous • Minimized memory-induced constraints • late binding of memory buffers • late locking of memory buffers • “Block size” can be application-specific • Support for speculative tasks • Support for rate control http://www.pdl.cmu.edu/ 22 Eno Thereska November 2004 API description: task registration application fb_read (addr_range, blksize,…) fb_write (addr_range, blksize,…) Background Foreground background scheduler foreground scheduler http://www.pdl.cmu.edu/ 23 Eno Thereska November 2004 API description: task completion application callback_fn (addr, buffer, flag, …) Background Foreground background scheduler foreground scheduler http://www.pdl.cmu.edu/ 24 Eno Thereska November 2004 API description: late locking of buffers application buffer = getbuffer_fn (addr, …) Background Foreground background scheduler foreground scheduler http://www.pdl.cmu.edu/ 25 Eno Thereska November 2004 API description: aborting/promoting tasks application fb_abort (addr_range, …) fb_promote (addr_range, …) Background Foreground background scheduler foreground scheduler http://www.pdl.cmu.edu/ 26 Eno Thereska November 2004 Complete API http://www.pdl.cmu.edu/ 27 Eno Thereska November 2004 Designing disk maintenance applications • APIs talk in terms of logical blocks (LBNs) • Some applications need structured version • as presented by file system or database • Example consistency issues • application wants to read file “foo” • registers task for inode’s blocks • by time blocks read, file may not exist anymore! http://www.pdl.cmu.edu/ 28 Eno Thereska November 2004 Designing disk maintenance applications • Application does not care about structure • scrubbing, data migration, array reconstruction • Coordinate with file system/database • cache write-backs, LFS cleaner, index generation • Utilize snapshots • backup, background fsck http://www.pdl.cmu.edu/ 29 Eno Thereska November 2004 Outline • • • • • Motivation and overview The freeblock subsystem Background application interfaces Example applications Conclusions http://www.pdl.cmu.edu/ 30 Eno Thereska November 2004 Example #1: Physical backup • Backup done using snapshots snapshot subsystem (in FS) sys_fb_getrecord() getblks() sys_fb_read() backup application freeblock subsystem http://www.pdl.cmu.edu/ 31 Eno Thereska November 2004 Example #1: Physical backup • Experimental setup • • • • • 18 GB Seagate Cheetah 36ES FreeBSD in-kernel implementation PIII with 384MB of RAM 3 benchmarks used: Synthetic, TPC-C, Postmark snapshot includes 12GB of disk • GOAL: read whole snapshot for free http://www.pdl.cmu.edu/ 32 Eno Thereska November 2004 Backup completed for free 90 80 Backup time (mins) 70 60 50 < 2% impact on foreground workload 40 30 20 10 0 Idle system http://www.pdl.cmu.edu/ Synthetic TPC-C 33 Postmark Eno Thereska November 2004 Example #2: Cache write-backs • Must flush dirty buffers • for space reclamation • for persistence (if memory is not NVRAM) • Simple cache manager extensions • • • • fb_write(dirty_buffer,…) getbuffer_fn(dirty_buffer,…) fb_promote(dirty_buffer,…) fb_abort(dirty_buffer,…) http://www.pdl.cmu.edu/ 34 when blocks become dirty calling back to lock when flush is forced when block dies in cache Eno Thereska November 2004 Example #2: Cache write-backs • Experimental setup • • • • • 18 GB Seagate Cheetah 36ES PIII with 384MB of RAM controlled experiments with synthetic workload benchmarks (same as used before) in FreeBSD syncer daemon wakes up every 1 sec and flushes entries that have been dirty > 30secs • GOAL: write back dirty buffers for free http://www.pdl.cmu.edu/ 35 Eno Thereska November 2004 % improvement in avg. resp. time % dirty buffers cleaned for free Foreground read:write has impact 100 80 60 40 20 0 0 1:2 1:1 2:1 read-write ratio http://www.pdl.cmu.edu/ 100 80 60 40 20 0 0 1:2 1:1 2:1 read-write ratio 36 Eno Thereska November 2004 100 80 60 40 20 0 LRU+Syncer http://www.pdl.cmu.edu/ LRU only % improvement in avg. resp. time % dirty buffers cleaned for free 95% of NVRAM cleaned for free 37 100 80 60 40 20 0 LRU+Syncer LRU only Eno Thereska November 2004 50 40 30 20 10 0 Synthetic TPC-C Postmark http://www.pdl.cmu.edu/ % improvement in app. throughput % dirty buffers cleaned for free 10-20% improvement in overall perf. 38 50 40 30 20 10 0 Synthetic TPC-C Postmark Eno Thereska November 2004 Example #3: Layout reorganizer • Layout reorganization improves access latencies • defragmentation is a type of reorganization • typical example of background activity • Our experiment: • disk used is 18GB • we want to defrag up to 20% of it • goal: defrag for free http://www.pdl.cmu.edu/ 39 Eno Thereska November 2004 Disk Layout Reorganized for Free! 600 Reorganization time (mins) Random 500 Circular 400 Track Shuffle 300 200 100 0 1% http://www.pdl.cmu.edu/ 10% 20% 1% 8MB Reorganizer buffer size (MB) 40 10% 64MB 20% Eno Thereska November 2004 Other maintenance applications • • • • • Virus scanner LFS cleaner Disk scrubber Data mining Data migration http://www.pdl.cmu.edu/ 41 Eno Thereska November 2004 Summary • Framework for building background storage applications • Asynchronous interfaces • applications describe what they need • storage subsystem satisfies their needs • Works well for real applications http://www.pdl.cmu.edu/Freeblock/ http://www.pdl.cmu.edu/ 42 Eno Thereska November 2004