Gecko Storage System Tudor Marian, Lakshmi Ganesh, and Hakim Weatherspoon Cornell University Gecko • Save power by spinning/powering down disks – E.g. RAID-1 mirror scheme with 5 primary/mirrors – File system (FS) access pattern of disk is arbitrary • Depends on FS internals, and gets worse as FS ages – When to turn disks off? What if prediction is wrong? write(fd,…) read(fd,…) Block Device Predictable Writes • Access same disks predictably for long periods – Amortize the cost of spinning down & up disks • Idea: Log Structured Storage/File System – Writes go to the head of the log until disk(s) full write(fd,…) Log head Log tail Block Device Unpredictable Reads • What about reads? May access any part of log! – Keep only the “primary” disks spinning • Trade off read throughput for power savings – Can afford to spin up disks on demand as load surges • File/buffer cache absorbs read traffic anyway read(fd,…) write(fd,…) Log head Log tail Block Device Stable Throughput • Unlike LSF, reads do not interfere with writes – Keep data from head (written) disks in file cache – Log cleaning not on the critical path • Afford to incur penalty of on-demand disk spin-up • Return reads from primary, clean log from mirror read(fd,…) write(fd,…) Log head Log tail Block Device Design Virtual File System (VFS) File/Buffer Cache Disk Disk Block Filesystem Filesystem Device File Mapping Layer Generic Block Layer Device Mapper I/O Scheduling Layer (anticipatory, CFQ, deadline, null) Block Device Drivers Design Overview • Log structured storage at block level – Akin to SSD wear-leveling • Actually, supersedes on-chip wear-leveling of SSDs – The design works with RAID-1, RAID-5, and RAID-6 • RAID-5 ≈ RAID-4 due to the append-nature of log – The parity drive(s) are not a bottleneck since writes are appends • Prototype as a Linux kernel dm (device-mapper) – Real, high-performance, deployable implementation Challenges • dm-gecko – All IO requests at this storage layer are asynchronous – SMP-safe: leverages all available CPU cores – Maintain in-core (RAM) large memory maps • battery backed NVRAM, and persistently stored on SSD • Map: virtual block <-> linear block <-> disk block (8 sectors) • To keep maps manageable: block size = page size (4K) – FS layered atop uses block size = page size – Log cleaning/garbage collection (gc) in the background • Efficient cleaning policy: when write IO capacity is available Dell PowerEdge R710 Commodity Architecture Dual Socket Multi-core CPUs Battery Backed RAM OCZ RevoDrive PCIe x4 SSD 2TB Hitachi HDS72202 Disks dm-gecko • In-memory map (one-level of indirection) • virtual block: conventional block array exposed to VFS • linear block: the collection of blocks structured as a log – Circular ring structure • E.g.: READs are simply indirected read block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks dm-gecko • WRITE operations are append to log head – Allocate/claim the next free block • Schedule log compacting/cleaning (gc) if necessary – Dispatch write IO on new block write block • Update maps & log on IO completion Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks dm-gecko • TRIM operations free the block • Schedule log compacting/cleaning (gc) if necessary – Fast forward the log tail if the tail block was trimmed trim block Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks Log Cleaning • Garbage collection (gc) block compacting – Relocate the used block that is closest to tail • Repeat until compact (e.g. watermark), or fully contiguous – Use spare IO capacity, do not run when IO load is high – More than enough CPU cycles to spare (e.g. 2x quad core) Virtual Block Device Linear Block Device Log head Log tail Free blocks Used blocks Gecko IO Requests • All IO requests at storage layer are asynchronous – Storage stack is allowed to reorder requests – VFS, file system mapping, and file/buffer cache play nice – Un-cooperating processes may trigger inconsistencies • Read/write and write/write conflicts are fair game • Log cleaning interferes w/ storage stack requests – SMP-safe solution that leverages all available CPU cores – Request ordering is enforced as needed • At block granularity Request Ordering • Block b has no prior pending requests – Allow read or write request to run, mark block w/ ‘pending IO’ – Allow gc to run, mark block as ‘being cleaned’ • Block b has prior pending read/write requests – Allow read or write requests, track the number of `pending IO’ – If gc needs to run on block b, defer until all read/write requests have completed (zero `pending IOs’ on block b) • Block b is being relocated by the gc – Discard gc requests on same block b (doesn’t actually occur) – Defer all read/write requests until gc has completed on block b Limitations • In-core memory map (there are two maps) – Simple, direct map requires lots of memory – Multi-level map is complex • Akin to virtual memory paging, only simpler – Fetch large portions of the map on demand from larger SSD • Current prototype uses two direct maps: Linear (total)disk capacity Block size # of map entries Size of map entry Memory per map 6 TB 4 KB 3 x 229 4 bytes / 32 bits 6 GB 8 TB 4 KB 231 4 bytes / 32 bits 8 GB