Spatial data management
over flash memory
Ioannis Koltsidas and Stratis D. Viglas
SSTD 2011, Minneapolis, MN
Flash: a disruptive technology




Orders of magnitude better performance than HDD
Low power consumption
Dropping prices
Idea: throw away HDDs and replace everything with flash SSDs


Not enough capacity
Not enough money to buy the not-enough-capacity
However, flash technology is somewhat enforced



Mobile devices
Low-power data centers and clusters
Potentially all application areas dealing with spatial data
We must seamlessly integrate Flash into the storage hierarchy



2
Need custom, flash-aware solutions
Koltsidas and Viglas, SSTD 2011
Outline
Flash-based device design



Flash memory
Solid state drives
Spatial data challenges




3
Taking advantage of asymmetry
Storage and indexing
Buffering and caching
Koltsidas and Viglas, SSTD 2011

Flash cell: a floating gate transistor

Float gate

Control Gate

Oxide Layer

Two states: float gate charged or not (‘0’ or ‘1’)
N
The charge changes the threshold voltage
(VT) of the cell
Source

To read: apply a voltage between possible
VT values

the MOSFET channel conducts (‘1’)

or, it remains insulating (‘0’)
After a number of program/erase cycles,
the oxide wears out
Oxide Layer
N
Drain
P
P-Type Silicon Substrate
•
•
Single-Level-Cell (SLC): one bit per cell
Multi-Level-Cell (MLC): two or more
bits per cell
o
o
4
Control Gate
Float Gate
Electrons get trapped in the float gate


Source Line

Bit Line
Flash memory cells
The cell can sense the amount of
current flow
Programming takes longer, puts more
strain on the oxide
Koltsidas and Viglas, SSTD 2011
Flash memory arrays
NOR or NAND flash depending on how the cells are connected form arrays
Flash page: the unit of read / program operations (typically 2kB – 8kB)
Flash block: the unit of erase operations (typically 32 – 128 pages)




Before a page can be re-programmed, the whole block has to be erased first
Reading much faster than writing a page


It takes some time before the cell charge reaches a stable state
Erasing takes two orders of magnitude more time than reading

5
Consumer MLC
(cMLC)
Enterprise MLC
(eMLC)
SLC
Page Read (μsec)
50
50
25
Page Program (μsec)
900
1500
250
Block Erase (μsec)
2000-5000
2000-5000
1500-2000
Endurance (P/E cycles)
~3K-5K
~30K
~100K
Koltsidas and Viglas, SSTD 2011
Flash-based Solid State Drives (SSDs)
Common I/O interface


Block-addressable interface
No mechanical latency



Access latency independent of the access pattern
30 to 50 times more efficient in IOPS/$ per GB than HDDs
Read/write asymmetry



Reads are faster than writes
Erase-before-write limitation
Limited endurance and the need for wear leveling


5 year warranty for enterprise SSDs (assuming 10 complete re-writes per day)
Energy efficiency


100 – 200 times more efficient than HDDs in IOPS / Watt
Physical properties



6
Resistance to extreme shock, vibration, temperature, altitude
Near-instant start-up time
Koltsidas and Viglas, SSTD 2011
SSD challenges
Host interface



Flash memory: read_flash_page, program_flash_page, erase_flash_block
Typical block device interface: read_sector, write_sector
Writes in place would kill performance, lifetime
Solution: perform writes out-of-place





Amortize block erasures over many write operations
Writes go to spare, erased blocks; old pages are invalidated
Device logical block address (LBA) space ≠ physical block address (PBA) space
Flash Translation Layer (FTL)




Address translation (logical-to-physical mapping)
Garbage collection (block reclamation)
Wear-leveling
logical page
LBA space
device level
Flash Translation Layer
flash chip level
PBA space
flash page
7
flash block
spare capacity
Koltsidas and Viglas, SSTD 2011
15k RPM SAS HDD: ~250-300 IOPS
7.2k RPM SATA HDD: ~80 IOPS
Off-the-shelf SSDs
Form Factor
A
B
C
D
E
PATA Drive
SATA Drive
SATA Drive
SAS Drive
PCI-e card
Consumer
Consumer
Consumer
Enterprise
Enterprise
Flash Chips
MLC
MLC
MLC
SLC
SLC
Capacity
32 GB
100 GB
160GB
140GB
450 GB
Read
Bandwidth
53 MB/s
285 MB/s
250 MB/s
220 MB/s
700 MB/s
Write
Bandwidth
28 MB/s
250 MB/s
100 MB/s
115 MB/s
500 MB/s
Random 4kB
Read IOPS
~ 1 order of magnitude
3.5k
30k
35k
45k
140k
> 2 orders of magnitude
Random 4kB
Write IOPS
0.01k
10k
0.6k
16k
70k
Street Price:
~ 15 $/GB
(2007)
~ 4 $/GB
(2010)
~ 2.5 $/GB
(2010)
~18 $/GB
(2011)
~ 38 $/GB
(2009)
8
Koltsidas and Viglas, SSTD 2011
Work so far: better FTL algorithms
Hide the complexity from the user by adding intelligence
at the controller level


Great! (for the majority of user-level applications)
But as is usually the case, you can’t have a one-size-fits-all
solution
Data management applications have a much better
understanding of access patterns




9
File systems don’t
Spatial data management has even specific needs
Koltsidas and Viglas, SSTD 2011
Competing goals

SSD designers assume a generic filesystem above the device
Goals:


Hide the complexities of flash memory

Improve performance for generic workloads and I/O patterns

Protect their competitive advantage, by hiding algorithm and implementation
details
DBMS designers have full control of the I/O issued to the device
Goals:

Predictability for I/O operations, independence of hardware specifics

Clear characterization of I/O patterns

Exploit synergies between query processing and flash memory properties
10
Koltsidas and Viglas, SSTD 2011
A (modest) proposal for areas to focus on

Data structure level




Ways of helping the FTL
Introduce imbalance to tree structures
Trade (cheap) reads for (expensive) writes
Memory management




11
Add spatial intelligence to the buffer pool
Take advantage of work on spatial trajectory prediction
Combine with cost-based replacement
Prefetch data, delay expensive writes
Koltsidas and Viglas, SSTD 2011
Turning asymmetry into an advantage


Common characteristic of all SSDs: low random read
latency
Write speed and throughput differ dramatically across
types of device


Sometimes write speed is orders of magnitude slower than
read speed
Key idea: if we don’t need to write, then we shouldn’t


12
Procrastination might pay off in the long term
Only write if the cost has been expensed
Koltsidas and Viglas, SSTD 2011
Read/write asymmetry

Consider the case where writes are x times more
expensive than reads


This means that for each write we avoid, we “gain” x time units
Take any R-tree structure and introduce controlled
imbalance

Rebalance when we have expensed the cost
balanced insertion
original setup
parent
parent
parent
overflowing
child
overflowing
child
overflowing
child
newly
allocated
sibling
unbalanced insertion
overflow area
13
Koltsidas and Viglas, SSTD 2011
In more detail



Parent P
Overflowing node L
On overflow, allocate
overflow node S



P
L
Instead of performing three
writes (nodes P, L, and S), we
perform two (nodes L and S)
We have saved 2x time units
P
only L and S
nodes are
written, not P
Record at L a counter c



14
Increment each time we
traverse L to get to S
Once counter reaches x,
rebalance
The cost has been expensed
L
c
S
P
rebalance
when c > x
L
c
S
Koltsidas and Viglas, SSTD 2011
Observations

If there are no “hotspots” in the R-tree then we have potentially
huge gains



Method is applicable either at the leaves, or at the index nodes

Likelihood of rebalancing proportional to the level the imbalance was
introduced (i.e., the deeper the level of imbalance, the higher the
likelihood)
Good fit to data access patterns in location-aware spatial services



Counter-intuitive: the more imbalance, the lower the I/O cost
In the worst case, as good as a balanced tree
Update rate is relatively low; point queries are highly volatile as users
move about an area
Extensions in hybrid server-oriented configurations


15
Both HDDs and SSDs are used for persistent storage
Write-intensive (and potentially unbalanced) nodes placed on the HDD
Koltsidas and Viglas, SSTD 2011
Cost-based replacement


Choice of victim depends on probability of reference (as
usual)
But the eviction cost is not uniform



It doesn’t hurt if we misestimate the heat of a page


Clean pages bear no write cost, dirty pages result in a write
I/O asymmetry: writes more expensive than reads
So long as we save (expensive) writes
Key idea: combine LRU-based replacement with costbased algorithms

16
Applicable both in SSD-only as well as hybrid systems
Koltsidas and Viglas, SSTD 2011
In more detail


Starting point: cost-based
page replacement
Divide the buffer pool into
two regions




Evict from time region to
cost region
Final victim is always from
the cost region
17
Cost region
cost

Time region: typical LRU
Multiple LRU queues, one per
cost class
Order queues based on cost
Time region
Koltsidas and Viglas, SSTD 2011
Location-awareness


Host of work in wireless
networks dealing with
trajectory prediction
Consider the case where
services are offered based
on user location



Primary data are stored in an
R-tree
User location triggers queries
on the R-tree
User motion creates
hotspots (more precisely,
hot paths) on the tree
structure
18
Koltsidas and Viglas, SSTD 2011
Location-aware buffer pool management

What if the classes of the cost segment track user motion?





The lower the utility of the page being in the buffer pool, the higher
the eviction cost
Utility correlated with motion trajectory
As the user moves about an area new pages are brought in the buffer
pool and older pages are evicted
Potentially huge savings if trajectory is tracked accurately
enough
Flashmobs (pun intended!)




19
Users tend to move in sets into areas of interest
Overall response time of the system minimized
Recency/frequency of access may not be able to predict future
behavior
Trajectory tracking potentially will
Koltsidas and Viglas, SSTD 2011
Conclusions and outlook

Flash memory and SSDs are becoming ubiquitous


Need for new data structures and algorithms




Existing ones target the memory-disk performance bottleneck
That bottleneck is smaller with SSDs
A new bottleneck has appeared: read/write asymmetry
Introduce imbalance at the data structure level


Both at the mobile device and at the enterprise levels
Trade reads for writes through the allocation of overflow
nodes
Take cost into account when managing main memory

20
Cost-based replacement based on motion tracking and
trajectory prediction
Koltsidas and Viglas, SSTD 2011