Lecture 25
Aditya Akella
• BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers)
• SILT: more “traditional” key-value store
Cheap and Large CAMs for High
Performance Data-Intensive
Networked Systems
Ashok Anand, Chitra Muthukrishnan, Steven Kappes, and
Aditya Akella
University of Wisconsin-Madison
Suman Nath
Microsoft Research
New data-intensive networked systems
Large hash tables (10s to 100s of GBs)
New data-intensive networked systems
WAN optimizers
Object
Data center
Object
Key
(20 B)
WAN
Chunk pointer
Branch office
Large hash tables (32 GB)
Look up
Chunks(4 KB)
Object store (~4 TB)
High speed (~10K/sec) lookups for 500 Mbps link
Hashtable (~32GB)
High speed (~10 K/sec) inserts and evictions
New data-intensive networked systems
• Other systems
– De-duplication in storage systems (e.g., Datadomain)
– CCN cache (Jacobson et al., CONEXT 2009)
– DONA directory lookup (Koponen et al., SIGCOMM
2006)
Cost-effective large hash tables
C heap L arge c AMs
Disk
DRAM
Too slow
+ Price statistics from 2008-09
Random reads/sec
Random writes/sec
Cost
(128 GB)
250 250 $30 +
300K 300K $120K +
Too expensive
2.5 ops/sec/$
Flash-SSD 10K* 5K*
* Derived from latencies on Intel M-18 SSD in experiments
$225 +
Slow writes
How to deal with slow writes of Flash SSD
• New data structure “BufferHash” + Flash
• Key features
– Avoid random writes, and perform sequential writes in a batch
• Sequential writes are 2X faster than random writes (Intel
SSD)
• Batched writes reduce the number of writes going to Flash
– Bloom filters for optimizing lookups
BufferHash performs orders of magnitude better than
DRAM based traditional hash tables in ops/sec/$
• Random writes are expensive
Avoid random page writes
• Reads and writes happen at the granularity of a flash page
I/O smaller than page should be avoided, if possible
Conventional hash table on Flash/SSD
Keys are likely to hash to random locations
Flash
Random writes
SSDs: FTL handles random writes to some extent;
But garbage collection overhead is high
~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s
Conventional hash table on Flash/SSD
DRAM
Can’t assume locality in requests – DRAM as cache won’t work
Flash
• Control the impact of random writes
• Maintain small hash table ( buffer ) in memory
• As in-memory buffer gets full, write it to flash
– We call in-flash buffer, incarnation of buffer
DRAM Flash SSD
Buffer: In-memory hash table
Incarnation: In-flash hash table
DRAM
Buffer
Latest incarnation
Flash
Incarnation
4 3 2
Incarnation table
1 Oldest incarnation
Net hash table is: buffer + all incarnations
Lookups are impacted due to buffers
DRAM
Lookup key
Buffer
Flash
In-flash look ups
4 3 2 1
Incarnation table
Multiple in-flash lookups. Can we limit to only one?
Bloom filters for optimizing lookups
Lookup key
DRAM
Buffer
Bloom filters
In-memory look ups
False positive!
Flash
4 3 2 1
Configure carefully!
Incarnation table
2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!
Update key
DRAM
Buffer
Bloom filters
Flash Update key
Expensive random writes
4 3 2 1
Incarnation table
Discard this naïve approach
Update key
DRAM
Buffer
Bloom filters
Insert key
Key, new value
Flash
Key, old value
4 3 2 1
Incarnation table
Lookups check latest incarnations first
• Eviction policies may depend on application
– LRU, FIFO, Priority based eviction, etc.
• Two BufferHash primitives
– Full Discard: evict all items
• Naturally implements FIFO
– Partial Discard: retain few items
• Priority based eviction by retaining high priority items
• BufferHash best suited for FIFO
– Incarnations arranged by age
– Other useful policies at some additional cost
• Details in paper
• Single buffer in
DRAM
– All operations and eviction policies
DRAM
Flash
• High worst case insert latency
– Few seconds for 1
GB buffer
– New lookups stall
4 3 2 1
Buffer
Bloom filters
Incarnation table
• Partition buffers
– Based on first few bits of key space
– Size > page
• Avoid i/o less than page
– Size >= block
• Avoid random page writes
• Reduces worst case latency
• Eviction policies apply per buffer
DRAM
Flash
4
0 XXXXX 1 XXXXX
3 2
Incarnation table
1
• Multiple buffers in memory
• Multiple incarnations per buffer in flash
• One in-memory bloom filter per incarnation
DRAM
Flash
Buffer 1
. .
. .
Buffer K
Net hash table = all buffers + all incarnations
• Insertion latency
– Worst case size of buffer
– Average case is constant for buffer > block size
• Lookup latency
– Average case Number of incarnations
– Average case False positive rate of bloom filter
DRAM B1
Flash
Parameter tuning: Total size of Buffers
Total size of buffers = B1 + B2 + … + BN
Given fixed DRAM, how much allocated to buffers
Total bloom filter size = DRAM – total size of buffers
.
. .
.
BN
Lookup #Incarnations * False positive rate
# Incarnations = (Flash size/Total buffer size)
False positive rate increases as the size of bloom filters decrease
Too small is not optimal
Too large is not optimal either
Optimal = 2 * SSD/entry
What should be size of a partitioned buffer (e.g. B1) ?
DRAM B1
Flash
.
. .
.
BN
Affects worst case insertion
Adjusted according to application requirement
(128 KB – 1 block)
SILT: A Memory-Efficient,
High-Performance Key-Value Store
Hyeontaek Lim, Bin Fan, David G. Andersen
Michael Kaminsky†
Carnegie Mellon University
†Intel Labs
2011-10-24
Clients
Key-Value Store
Cluster
PUT(key, value) value = GET(key)
DELETE(key)
• E-commerce (Amazon)
• Web server acceleration (Memcached)
• Data deduplication indexes
• Photo storage (Facebook)
26
• SILT goal: use much less memory than previous systems while retaining high
performance.
27
Memory overhead = Index size per entry
• Ideally 0 (no memory overhead)
Read amplification = Flash reads per query
• Limits query throughput
• Ideally 1 (no wasted flash reads)
Write amplification = Flash writes per entry
• Limits insert throughput
• Also reduces flash life expectancy
• Must be small enough for flash to last a few years
28
Read amplification
6
4
2
0
0
?
SkimpyStash
HashCache
BufferHash FlashStore
FAWN-DS
2 4 6 8 10 12
Memory overhead (bytes/entry)
29
Queries look up stores in sequence (from new to old)
Inserts only go to Log
Data are moved in background
SILT Sorted Index
(Memory efficient)
SILT Filter
SILT Log Index
(Write friendly)
Memory
Flash
30
LogStore: No Control over Data Layout
Naive Hashtable (48+ B/entry)
SILT Log Index (6.5+ B/entry)
(Older)
Memory overhead
6.5+ bytes/entry
On-flash log
Memory
Flash
Inserted entries are appended
(Newer)
Write amplification
1
31
SortedStore: Space-Optimized Layout
SILT Sorted Index (0.4 B/entry)
On-flash sorted array
Memory overhead
0.4 bytes/entry
Memory
Flash
Need to perform bulkinsert to amortize cost
Write amplification
High
32
Combining SortedStore and LogStore
<SortedStore>
SILT Sorted Index
<LogStore>
SILT Log Index
Merge
On-flash sorted array On-flash log
33
Achieving both Low Memory Overhead and Low
Write Amplification
SortedStore
LogStore
• Low memory overhead
• High write amplification
• High memory overhead
• Low write amplification
SortedStore LogStore
Now we can achieve simultaneously:
Write amplification = 5.4 = 3 year flash life
Memory overhead = 1.3 B/entry
With “HashStores”, memory overhead = 0.7 B/entry!
34
<SortedStore> <HashStore> <LogStore>
SILT Sorted Index SILT Filter SILT Log Index
Merge Conversion
On-flash sorted array On-flash hashtables On-flash log
Memory overhead Read amplification Write amplification
0.7 bytes/entry 1.01
5.4
35
SILT Sorted Index SILT Filter & Log Index
Entropy-coded tries
For SortedStore
Highly compressed (0.4 B/entry)
Partial-key cuckoo hashing
For HashStore & LogStore
Compact (2.2 & 6.5 B/entry)
Very fast (> 1.8 M lookups/sec)
36
Read amplification
6
4
2
SkimpyStash
HashCache
BufferHash FlashStore
FAWN-DS
SILT
0
0 2 4 6 8 10 12
Memory overhead (bytes/entry)
38
• Background and motivation
• Our CLAM design
– Key operations (insert, lookup, update)
– Eviction
– Latency analysis and performance tuning
• Evaluation
• Configuration
– 4 GB DRAM, 32 GB Intel SSD, Transcend SSD
– 2 GB buffers, 2 GB bloom filters, 0.01 false positive rate
– FIFO eviction policy
• WAN optimizer workload
– Random key lookups followed by inserts
– Hit rate (40%)
– Used workload from real packet traces also
• Comparison with BerkeleyDB (traditional hash table) on Intel SSD
Average latency BufferHash
Look up (ms) 0.06
Insert (ms) 0.006
BerkeleyDB
4.6
Better lookups!
4.8
Better inserts!
CDF
1.0
0.8
0.6
0.4
0.2
Bufferhash
99% inserts < 0.1 ms
BerkeleyDB
40% of inserts > 5 ms !
0,01 0,1 1
Insert latency (ms) on Intel SSD
Buffering effect! Random writes are slow!
CDF
1.0
0.8
0.6
0.4
0.2
Bufferhash BerkeleyDB
99% of lookups < 0.2ms
40% of lookups > 5 ms
Garbage collection overhead due to writes!
0,01 0,1 1 10
Lookup latency (ms) for 40% hit workload
• 16K lookups/sec and 160K inserts/sec
• Overall cost of $400
• 42 lookups/sec/$ and 420 inserts/sec/$
– Orders of magnitude better than 2.5 ops/sec/$ of
DRAM based hash tables
• Varying fractions of lookups
• Results on Trancend SSD
Average latency per operation
Lookup fraction BufferHash
0 0.007 ms
BerkeleyDB
18.4 ms
0.5
1
0.09 ms
0.12 ms
10.3 ms
0.3 ms
• BufferHash ideally suited for write intensive workloads
• BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on
DRAM (and disks)
• BufferHash is best suited for FIFO eviction policy
– Other policies can be supported at additional cost, details in paper
• WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with
BerkeleyDB
– Details in paper
• FAWN (Vasudevan et al., SOSP 2009)
– Cluster of wimpy nodes with flash storage
– Each wimpy node has its hash table in DRAM
– We target…
• Hash table much bigger than DRAM
• Low latency as well as high throughput systems
• HashCache (Badam et al., NSDI 2009)
– In-memory hash table for objects stored on disk
• With BerkeleyDB, throughput up to 10 Mbps
• With BufferHash, throughput up to 200 Mbps with Transcend SSD
– 500 Mbps with Intel SSD
• At 10 Mbps, average throughput per object improves by 65% with BufferHash
1. Various combinations of indexing schemes
2. Background operations (merge/conversion)
3. Query latency
Experiment Setup
CPU
Flash drive
Workload size
Query pattern
2.80 GHz (4 cores)
SATA 256 GB
(48 K random 1024-byte reads/sec)
20-byte key, 1000-byte value, ≥ 50 M keys
Uniformly distributed (worst for SILT)
51
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
52
LogStore+SortedStore: Still Much Memory
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
53
Workload: 90% GET (50-100 M keys) + 10% PUT (50 M keys)
54
Small Impact from Background Operations
Workload: 90% GET (100~ M keys) + 10% PUT
40 K
Oops! bursty
TRIM by ext4 FS
33 K
55
Workload: 100% GET (100 M keys)
Best tput @ 16 threads
Median = 330 μs
99.9 = 1510 μs
# of I/O threads
56
• SILT provides both memory-efficient and
high-performance key-value store
– Multi-store approach
– Entropy-coded tries
– Partial-key cuckoo hashing
• Full source code is available
– https://github.com/silt/silt
57