Spatial data management over flash memory Ioannis Koltsidas and Stratis D. Viglas SSTD 2011, Minneapolis, MN Flash: a disruptive technology Orders of magnitude better performance than HDD Low power consumption Dropping prices Idea: throw away HDDs and replace everything with flash SSDs Not enough capacity Not enough money to buy the not-enough-capacity However, flash technology is somewhat enforced Mobile devices Low-power data centers and clusters Potentially all application areas dealing with spatial data We must seamlessly integrate Flash into the storage hierarchy 2 Need custom, flash-aware solutions Koltsidas and Viglas, SSTD 2011 Outline Flash-based device design Flash memory Solid state drives Spatial data challenges 3 Taking advantage of asymmetry Storage and indexing Buffering and caching Koltsidas and Viglas, SSTD 2011 Flash cell: a floating gate transistor Float gate Control Gate Oxide Layer Two states: float gate charged or not (‘0’ or ‘1’) N The charge changes the threshold voltage (VT) of the cell Source To read: apply a voltage between possible VT values the MOSFET channel conducts (‘1’) or, it remains insulating (‘0’) After a number of program/erase cycles, the oxide wears out Oxide Layer N Drain P P-Type Silicon Substrate • • Single-Level-Cell (SLC): one bit per cell Multi-Level-Cell (MLC): two or more bits per cell o o 4 Control Gate Float Gate Electrons get trapped in the float gate Source Line Bit Line Flash memory cells The cell can sense the amount of current flow Programming takes longer, puts more strain on the oxide Koltsidas and Viglas, SSTD 2011 Flash memory arrays NOR or NAND flash depending on how the cells are connected form arrays Flash page: the unit of read / program operations (typically 2kB – 8kB) Flash block: the unit of erase operations (typically 32 – 128 pages) Before a page can be re-programmed, the whole block has to be erased first Reading much faster than writing a page It takes some time before the cell charge reaches a stable state Erasing takes two orders of magnitude more time than reading 5 Consumer MLC (cMLC) Enterprise MLC (eMLC) SLC Page Read (μsec) 50 50 25 Page Program (μsec) 900 1500 250 Block Erase (μsec) 2000-5000 2000-5000 1500-2000 Endurance (P/E cycles) ~3K-5K ~30K ~100K Koltsidas and Viglas, SSTD 2011 Flash-based Solid State Drives (SSDs) Common I/O interface Block-addressable interface No mechanical latency Access latency independent of the access pattern 30 to 50 times more efficient in IOPS/$ per GB than HDDs Read/write asymmetry Reads are faster than writes Erase-before-write limitation Limited endurance and the need for wear leveling 5 year warranty for enterprise SSDs (assuming 10 complete re-writes per day) Energy efficiency 100 – 200 times more efficient than HDDs in IOPS / Watt Physical properties 6 Resistance to extreme shock, vibration, temperature, altitude Near-instant start-up time Koltsidas and Viglas, SSTD 2011 SSD challenges Host interface Flash memory: read_flash_page, program_flash_page, erase_flash_block Typical block device interface: read_sector, write_sector Writes in place would kill performance, lifetime Solution: perform writes out-of-place Amortize block erasures over many write operations Writes go to spare, erased blocks; old pages are invalidated Device logical block address (LBA) space ≠ physical block address (PBA) space Flash Translation Layer (FTL) Address translation (logical-to-physical mapping) Garbage collection (block reclamation) Wear-leveling logical page LBA space device level Flash Translation Layer flash chip level PBA space flash page 7 flash block spare capacity Koltsidas and Viglas, SSTD 2011 15k RPM SAS HDD: ~250-300 IOPS 7.2k RPM SATA HDD: ~80 IOPS Off-the-shelf SSDs Form Factor A B C D E PATA Drive SATA Drive SATA Drive SAS Drive PCI-e card Consumer Consumer Consumer Enterprise Enterprise Flash Chips MLC MLC MLC SLC SLC Capacity 32 GB 100 GB 160GB 140GB 450 GB Read Bandwidth 53 MB/s 285 MB/s 250 MB/s 220 MB/s 700 MB/s Write Bandwidth 28 MB/s 250 MB/s 100 MB/s 115 MB/s 500 MB/s Random 4kB Read IOPS ~ 1 order of magnitude 3.5k 30k 35k 45k 140k > 2 orders of magnitude Random 4kB Write IOPS 0.01k 10k 0.6k 16k 70k Street Price: ~ 15 $/GB (2007) ~ 4 $/GB (2010) ~ 2.5 $/GB (2010) ~18 $/GB (2011) ~ 38 $/GB (2009) 8 Koltsidas and Viglas, SSTD 2011 Work so far: better FTL algorithms Hide the complexity from the user by adding intelligence at the controller level Great! (for the majority of user-level applications) But as is usually the case, you can’t have a one-size-fits-all solution Data management applications have a much better understanding of access patterns 9 File systems don’t Spatial data management has even specific needs Koltsidas and Viglas, SSTD 2011 Competing goals SSD designers assume a generic filesystem above the device Goals: Hide the complexities of flash memory Improve performance for generic workloads and I/O patterns Protect their competitive advantage, by hiding algorithm and implementation details DBMS designers have full control of the I/O issued to the device Goals: Predictability for I/O operations, independence of hardware specifics Clear characterization of I/O patterns Exploit synergies between query processing and flash memory properties 10 Koltsidas and Viglas, SSTD 2011 A (modest) proposal for areas to focus on Data structure level Ways of helping the FTL Introduce imbalance to tree structures Trade (cheap) reads for (expensive) writes Memory management 11 Add spatial intelligence to the buffer pool Take advantage of work on spatial trajectory prediction Combine with cost-based replacement Prefetch data, delay expensive writes Koltsidas and Viglas, SSTD 2011 Turning asymmetry into an advantage Common characteristic of all SSDs: low random read latency Write speed and throughput differ dramatically across types of device Sometimes write speed is orders of magnitude slower than read speed Key idea: if we don’t need to write, then we shouldn’t 12 Procrastination might pay off in the long term Only write if the cost has been expensed Koltsidas and Viglas, SSTD 2011 Read/write asymmetry Consider the case where writes are x times more expensive than reads This means that for each write we avoid, we “gain” x time units Take any R-tree structure and introduce controlled imbalance Rebalance when we have expensed the cost balanced insertion original setup parent parent parent overflowing child overflowing child overflowing child newly allocated sibling unbalanced insertion overflow area 13 Koltsidas and Viglas, SSTD 2011 In more detail Parent P Overflowing node L On overflow, allocate overflow node S P L Instead of performing three writes (nodes P, L, and S), we perform two (nodes L and S) We have saved 2x time units P only L and S nodes are written, not P Record at L a counter c 14 Increment each time we traverse L to get to S Once counter reaches x, rebalance The cost has been expensed L c S P rebalance when c > x L c S Koltsidas and Viglas, SSTD 2011 Observations If there are no “hotspots” in the R-tree then we have potentially huge gains Method is applicable either at the leaves, or at the index nodes Likelihood of rebalancing proportional to the level the imbalance was introduced (i.e., the deeper the level of imbalance, the higher the likelihood) Good fit to data access patterns in location-aware spatial services Counter-intuitive: the more imbalance, the lower the I/O cost In the worst case, as good as a balanced tree Update rate is relatively low; point queries are highly volatile as users move about an area Extensions in hybrid server-oriented configurations 15 Both HDDs and SSDs are used for persistent storage Write-intensive (and potentially unbalanced) nodes placed on the HDD Koltsidas and Viglas, SSTD 2011 Cost-based replacement Choice of victim depends on probability of reference (as usual) But the eviction cost is not uniform It doesn’t hurt if we misestimate the heat of a page Clean pages bear no write cost, dirty pages result in a write I/O asymmetry: writes more expensive than reads So long as we save (expensive) writes Key idea: combine LRU-based replacement with costbased algorithms 16 Applicable both in SSD-only as well as hybrid systems Koltsidas and Viglas, SSTD 2011 In more detail Starting point: cost-based page replacement Divide the buffer pool into two regions Evict from time region to cost region Final victim is always from the cost region 17 Cost region cost Time region: typical LRU Multiple LRU queues, one per cost class Order queues based on cost Time region Koltsidas and Viglas, SSTD 2011 Location-awareness Host of work in wireless networks dealing with trajectory prediction Consider the case where services are offered based on user location Primary data are stored in an R-tree User location triggers queries on the R-tree User motion creates hotspots (more precisely, hot paths) on the tree structure 18 Koltsidas and Viglas, SSTD 2011 Location-aware buffer pool management What if the classes of the cost segment track user motion? The lower the utility of the page being in the buffer pool, the higher the eviction cost Utility correlated with motion trajectory As the user moves about an area new pages are brought in the buffer pool and older pages are evicted Potentially huge savings if trajectory is tracked accurately enough Flashmobs (pun intended!) 19 Users tend to move in sets into areas of interest Overall response time of the system minimized Recency/frequency of access may not be able to predict future behavior Trajectory tracking potentially will Koltsidas and Viglas, SSTD 2011 Conclusions and outlook Flash memory and SSDs are becoming ubiquitous Need for new data structures and algorithms Existing ones target the memory-disk performance bottleneck That bottleneck is smaller with SSDs A new bottleneck has appeared: read/write asymmetry Introduce imbalance at the data structure level Both at the mobile device and at the enterprise levels Trade reads for writes through the allocation of overflow nodes Take cost into account when managing main memory 20 Cost-based replacement based on motion tracking and trajectory prediction Koltsidas and Viglas, SSTD 2011