Differentiated Storage Services Michael Mesnier, Jason Akers, Feng Chen Tian Luo Intel Corporation The Ohio State University 23rd ACM Symposium on Operating Systems Principles (SOSP) 1 October 23-26, 2011, Cascais, Portugal Technology overview An analogy: moving & shipping Classification Policy assignment Policy enforcement Why should computer storage be any different? 2 Technology overview Differentiated Storage Services Classification Policy assignment Computer system Classifier QoS Policy Metadata Low latency Applications or DB Boot files Low latency Small files High throughput Media files High bandwidth … … I/O Classification Operating system I/O Classification (offline) Storage system Management firmware QoS Policies Storage Pool A Storage Pool B Storage Pool C File system Storage controller I/O Classification QoS Mechanisms Classify each I/O in-band = Current & future research 3 Policy enforcement The SCSI CDB 5 bits 32 classes 4 Technology overview Motivation: disk caching with SSDs Universal challenges in the industry – Keeping the right data cached – Avoiding thrash under cache pressure Conventional approaches – Cache bypass for large/sequential requests – Evict cold data (LRU commonly used) How I/O classification can help – Identify cacheable I/O classes – Assign relative caching priorities 5 Technology overview Filesystem prototypes (Ext3 & NTFS) FS classification FS policy assignment Computer system Classifier Cache priority Metadata 0 Applications or DB Journal 0 Directories 0 Files <= 4KB 1 Files <=16KB 2 Files <=64KB 3 … … Files > GB Lowest I/O Classification Operating system I/O Classification Storage system Management firmware QoS Policies Disk SSD File system Storage controller I/O Classification QoS Mechanisms Classify each I/O in-band = Current & future research 6 FS policy enforcement Technology overview Database prototype (PostgreSQL) DB classification Computer system Applications or DB I/O Classification Operating system DB policy assignment Classifier Cache priority System tables 0 Temp. tables (on write) 1 Randomly tables 2 Temp. tables (on read) 3 Sequential tables Bypass Index files Bypass Storage system Management firmware QoS Policies Disk I/O Classification SSD File system Storage controller I/O Classification QoS Mechanisms Classify each I/O in-band = Current & future research 7 DB policy enforcement Technology overview Selective cache algorithms Selective allocation – Always allocate high-priority classes – E.g. FS metadata and DB system tables always allocated – Conditionally allocate low-priority classes – Depends on cache pressure, cache contents, etc. – High/low cutoff is a tunable parameter Selective eviction – Evict in priority order (lowest priority first) – E.g., temporary DB tables evicted system tables – Trivially implemented by managing one LRU per class 8 Technology development 9 Technology development Ext3 prototype OS changes (block layer) – Add classifier to I/O requests – Only coalesce like-class requests – Copy classifier into SCSI CDB Ext3 changes – 18 classes identified – Optimized for a file server Small files & metadata A small kernel patch A one-time change to the FS 10 Ext3 Class Group Number Cache priority Unclassified 0 12 Superblock 1 0 Group desc. 2 0 Bitmap 3 0 Inode 4 0 Indirect block 5 0 Directories 6 0 Journal 7 0 File <= 4KB 8 1 File <= 16KB 9 2 File <= 64KB 10 3 … … … File > 1GB 18 11 Technology development Ext3 classification illustrated echo ‘Hello, world!’ >> foo; sync – – – – – – – READ_10(lba WRITE_10(lba WRITE_10(lba WRITE_10(lba WRITE_10(lba WRITE_10(lba WRITE_10(lba 231495 231495 16519223 16519231 16519239 16519247 8279 len len len len len len len 8 8 8 8 8 8 8 grp grp grp grp grp grp grp 9) 9) 8) 8) 8) 8) 5) <=4KB <=4KB Journal Journal Journal Journal Inode I/O classification shows read-modifywrite and metadata updates 7 I/Os (28KB) to write 13 bytes – Metadata accounts for most of the overhead NTFS classification is implemented with Windows filter drivers 11 Technology development PostgreSQL prototype Classification API: scatter/gather I/O fd=open("foo", O_RDWR|O_CLASSIFIED, 0666); class = 19; myiov[0].iov_base = &class; myiov[0].iov_len = 1; myiov[1].iov_base = “Hello, world!”; myiov[1].iov_len = 13; writev(fd, myiov, 2); OS changes (block layer) – Add O_CLASSIFIED file flag – Extract classifier from SG I/O A small OS & DB patch A one-time change to the OS & DB 12 Preliminary DB classes PostgreSQL class Group Number Unclassified 0 Transaction log 19 System table 20 Free space map 21 Temporary table 22 Random table 23 Sequential table 24 Index file 25 Reserved 26-31 Technology development Cache implementations Fully associative read/write LRU cache – Insert(), Lookup(), Delete(), etc. – Hash table maps disk LBA to SSD LBA – Syncer daemon asynchronously cleans cache Monitors cache pressure for selective allocate Maintains multiple LRU lists for selective evict Front-ends: iSCSI (OS independent) and Linux MD MD cache module (RAID-9) Striping: mdadm –create /dev/md0 –level=0 –raid-devices=2 /dev/sdd /dev/sde Mirroring: mdadm –create /dev/md0 –level=1 –raid-devices=2 /dev/sdd /dev/sde RAID-9: mdadm –create /dev/md0 –level=9 –raid-devices=2 <cache> <base 13 Evaluation 14 Evaluation Experimental setup Host OS (Xeon, 2-way, quad-core, 12GB RAM) – Linux 2.6.34 (patched as described) Target storage system – HW RAID array + X25-E cache Workloads and cache sizes – SPECsfs: 18GB (10% of 184GB working set) – TPC-H: 8GB (28% of 29GB working set) Comparison – LRU versus LRU-S (LRU with selective caching) 15 SPECsfs I/O breakdown 16 LRU LRU-S Large files pollute LRU cache (metadata and small files evicted) LRU-S fences off large file I/O SPECsfs performance metrics Hit rate Running time 1.8x speedup LRU LRU-S Syncer overhead LRU 17 LRU-S HDD LRU LRU-S I/O Throughput LRU LRU-S SPECsfs file latencies Reduction in write latency over HDD LRU LRU LRU-S LRU-S LRU suffers from write outliers (from eviction overheads) 18 Reduction in read latency over HDD LRU LRU-S LRU-S reduces read latency (most small files are cached) TPC-H I/O breakdown 19 LRU LRU-S Indexes pollute LRU cache (user tables evicted) LRU-S fences off index files TPC-H performance metrics Hit rate Running time 1.2x speedup LRU LRU-S Syncer overhead LRU 20 LRU-S HDD LRU LRU-S I/O Throughput LRU LRU-S Conclusion & future work Intelligent caching is just the beginning – Other types of performance differentiation – Security, reliability, retention, … Other applications we’re looking at – Databases – Hypervisors – Cloud storage – Big Data (NoSQL DB) Thank you! Questions? Work already underway in T10 Open source coming soon… Intel Confidential 21