Cache Craftiness for Fast Multicore Key

advertisement
Cache Craftiness for Fast
Multicore Key-Value Storage
Yandong Mao (MIT), Eddie Kohler
(Harvard), Robert Morris (MIT)
Let’s build a fast key-value store
• KV store systems are important
– Google Bigtable, Amazon Dynamo, Yahoo! PNUTS
• Single-server KV performance matters
– Reduce cost
– Easier management
• Goal: fast KV store for single multi-core server
– Assume all data fits in memory
– Redis, VoltDB
Feature wish list
• Clients send queries over network
• Persist data across crashes
• Range query
• Perform well on various workloads
– Including hard ones!
Hard workloads
• Skewed key popularity
– Hard! (Load imbalance)
• Small key-value pairs
– Hard!
• Many puts
– Hard!
• Arbitrary keys
– String (e.g. www.wikipedia.org/...) or integer
– Hard!
First try: fast binary tree
Throughput (req/sec, millions)
140M short KV, put-only, @16 cores
4
4
3
3
2
2
1
1
0
• Network/disk not bottlenecks
• High-BW NIC
• Multiple disks
• 3.7 million queries/second!
• Better?
• What bottleneck remains?
• DRAM!
Cache craftiness goes 1.5X farther
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
7
6
5
4
3
2
1
0
Binary
Masstree
Cache-craftiness: careful use of cache and memory
Contributions
• Masstree achieves millions of queries per second
across various hard workloads
–
–
–
–
Skewed key popularity
Various read/write ratios
Variable relatively long keys
Data >> on-chip cache
• New ideas
– Trie of B+ trees, permuter, etc.
• Full system
– New ideas + best practices (network, disk, etc.)
Experiment environment
• A 16-core server
– three active DRAM nodes
• Single 10Gb Network Interface Card (NIC)
• Four SSDs
• 64 GB DRAM
• A cluster of load generators
Potential bottlenecks in Masstree
Network
…
DRAM
…
log
log
Disk
Single multi-core server
NIC bottleneck can be avoided
• Single 10Gb NIC
– Multiple queue, scale to many cores
– Target: 100B KV pair => 10M/req/sec
• Use network stack efficiently
– Pipeline requests
– Avoid copying cost
Disk bottleneck can be avoided
• 10M/puts/sec => 1GB logs/sec!
• Single disk
Write throughput
Cost
Mainstream Disk
100-300 MB/sec
1 $/GB
High performance SSD
up to 4.4GB/sec
> 40 $/GB
• Multiple disks: split log
– See paper for details
Single multi-core server
DRAM bottleneck – hard to avoid
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
7
6
Cache-craftiness goes 1.5X
father, including the cost of:
• Network
• Disk
5
4
3
2
1
0
Binary
Masstree
DRAM bottleneck – w/o network/disk
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
10
9
Cache-craftiness goes 1.7X father!
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch
+Permuter
Masstree
DRAM latency – binary tree
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
6
5
4
B
A
3
C
O 𝑙𝑜𝑔2𝑁 serial
DRAM latencies!
Y
2
X
1
Z
…
0
Binary
VoltDB
10M keys
=>
2.7 us/lookup
380K lookups/core/sec
DRAM latency – Lock-free 4-way tree
• Concurrency: same as binary tree
• One cache line per node => 3 KV / 4 children
X
A
B
Y
Z
…
…
…
½ levels as binary tree
½ DRAM latencies as binary tree
4-tree beats binary tree by 40%
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
10
9
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch
+Permuter
Masstree
4-tree may perform terribly!
• Unbalanced: O 𝑁 serial DRAM latencies
– e.g. sequential inserts
A B
C
D
E
F
O(N) levels!
G
H
I
…
• Want balanced tree w/ wide fanout
B+tree – Wide and balanced
• Balanced!
• Concurrent main memory B+tree [OLFIT]
– Optimistic concurrency control: version technique
– Lookup/scan is lock-free
– Puts hold ≤ 3 per-node locks
Wide fanout B+tree is 11% slower!
140M short KV, put-only
Fanout=15, fewer levels than 4-tree, but
• # cache lines from DRAM >= 4-tree
• 4-tree: each internal node is full
• B+tree: nodes are ~75% full
• Serial DRAM latencies >= 4-tree
Throughput (req/sec, millions)
10
9
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch
+Permuter
Masstree
B+tree – Software prefetch
• Same as [pB+-trees]
4 lines
=
1 line
• Masstree: B+tree w/ fanout 15 => 4 cache lines
• Always prefetch whole node when accessed
• Result: one DRAM latency per node vs. 2, 3, or 4
B+tree with prefetch
140M short KV, put-only, @16 cores
Beats 4-tree by 9%
Balanced beats unbalanced!
Throughput (req/sec, millions)
10
9
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch
+Permuter
Masstree
Concurrent B+tree problem
• Lookups retry in case of a concurrent insert
insert(B)
A C
D
A
C
D
A B
C
D
• Lock-free 4-tree: not a problem
– keys do not move around
– but unbalanced
Intermediate state!
B+tree optimization - Permuter
• Keys stored unsorted, define order in tree nodes
Permuter: 64-bit integer
A C
0 12
insert(B)
D
A C
D
B
A C
D
B
…
0 31 2
…
• A concurrent lookup does not need to retry
– Lookup uses permuter to search keys
– Insert appears atomic to lookups
B+tree with permuter
140M short KV, put-only, @16 cores
Throughput (req/sec, millions)
10
Improve by 4%
9
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch
+Permuter
Masstree
Performance drops dramatically
when key length increases
Throughput (req/sec, millions)
Short values, 50% updates, @16 cores, no logging
9
8
7
Why? Stores key suffix indirectly,
thus each key comparison
• compares full key
• extra DRAM fetch
6
5
4
3
2
1
0
8
16
Keys differ in last 8B
24
Key length
32
40
48
Masstree – Trie of B+trees
• Trie: a tree where each
level is indexed by
fixed-length key
fragment
• Masstree: a trie with
fanout 264, but each
trie node is a B+tree
…
B+tree, indexed by k[0:7]
…
B+tree, indexed by k[8:15]
• Compress key prefixes!
B+tree, indexed by k[16:23]
Case Study: Keys share P byte prefix –
Better than single B+tree
𝑃
• trie levels
8
• each has one node only
A single B+tree
with 8B keys
…
Complexity
DRAM access
Masstree
O(log N)
O(log N)
Single B+tree
O(P log N)
O(P log N)
Masstree performs better
for long keys with prefixes
Throughput (req/sec, millions)
Short values, 50% updates, @16 cores, no logging
10
9
8
7
6
5
4
3
2
1
0
8B key comparison
vs.
full key comparison
8
16
24
Key length
32
Masstree
B+tree
40
48
Does trie of B+trees hurt
short key performance?
Throughput (req/sec, millions)
140M short KV, put-only, @16 cores
8% faster! More efficient code –
internal node handle 8B keys only
10
9
8
7
6
5
4
3
2
1
0
Binary
4-tree
B+tree
+Prefetch +Permuter
Masstree
Evaluation
• Masstree compare to other systems?
• Masstree compare to partitioned trees?
– How much do we pay for handling skewed
workloads?
• Masstree compare with hash table?
– How much do we pay for supporting range queries?
• Masstree scale on many cores?
Masstree performs well even with
persistence and range queries
20M short KV, uniform dist., read-only, @16 cores, w/ network
Memcached: not persistent
and no range queries
Throughput (req/sec, millions)
12
10
8
Redis: no range queries
6
4
Unfair: both have a richer
data and query model
2
0
0.04
0.22
MongoDB
VoltdB
Redis
Memcached
Masstree
Multi-core – Partition among cores?
• Multiple instances, one unique set of keys per inst.
– Memcached, Redis, VoltDB
B
A
Y
C
X
Z
• Masstree: a single shared tree
– each core can access all keys
– reduced imbalance
B
A
C
Y
X
Z
A single Masstree performs
better for skewed workloads
Throughput (req/sec, millions)
140M short KV, read-only, @16 cores, w/ network
12
No remote DRAM access
No concurrency control
10
Masstree
16 partitioned Masstrees
8
Partition: 80% idle time
1 partition: 40%
15 partitions: 4%
6
4
2
0
0
1
2
One partition receives δ
times more queries
3
4
5
δ
6
7
8
9
Cost of supporting range queries
• Without range query? One can use hash table
– No resize cost: pre-allocate a large hash table
– Lock-free: update with cmpxchg
– Only support 8B keys: efficient code
– 30% full, each lookup = 1.1 hash probes
• Measured in the Masstree framework
– 2.5X the throughput of Masstree
• Range query costs 2.5X in performance
Scale to 12X on 16 cores
Throughput (req/sec/core, millions)
Short KV, w/o logging
0.7
Perfect scalability
0.6
0.5
0.4
Get
0.3
0.2
• Scale to 12X
• Put scales similarly
• Limited by the shared
memory system
0.1
0
1
2
4
Number of cores
8
16
Related work
•
•
•
•
[OLFIT]: Optimistic Concurrency Control
[pB+-trees]: B+tree with software prefetch
[pkB-tree]: store fixed # of diff. bits inline
[PALM]: lock-free B+tree, 2.3X as [OLFIT]
• Masstree: first system combines them together,
w/ new optimizations
– Trie of B+trees, permuter
Summary
• Masstree: a general-purpose highperformance persistent KV store
• 5.8 million puts/sec, 8 million gets/sec
– More comparisons with other systems in paper
• Using cache-craftiness improves performance
by 1.5X
Thank you!
Download