Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT) Let’s build a fast key-value store • KV store systems are important – Google Bigtable, Amazon Dynamo, Yahoo! PNUTS • Single-server KV performance matters – Reduce cost – Easier management • Goal: fast KV store for single multi-core server – Assume all data fits in memory – Redis, VoltDB Feature wish list • Clients send queries over network • Persist data across crashes • Range query • Perform well on various workloads – Including hard ones! Hard workloads • Skewed key popularity – Hard! (Load imbalance) • Small key-value pairs – Hard! • Many puts – Hard! • Arbitrary keys – String (e.g. www.wikipedia.org/...) or integer – Hard! First try: fast binary tree Throughput (req/sec, millions) 140M short KV, put-only, @16 cores 4 4 3 3 2 2 1 1 0 • Network/disk not bottlenecks • High-BW NIC • Multiple disks • 3.7 million queries/second! • Better? • What bottleneck remains? • DRAM! Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 7 6 5 4 3 2 1 0 Binary Masstree Cache-craftiness: careful use of cache and memory Contributions • Masstree achieves millions of queries per second across various hard workloads – – – – Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache • New ideas – Trie of B+ trees, permuter, etc. • Full system – New ideas + best practices (network, disk, etc.) Experiment environment • A 16-core server – three active DRAM nodes • Single 10Gb Network Interface Card (NIC) • Four SSDs • 64 GB DRAM • A cluster of load generators Potential bottlenecks in Masstree Network … DRAM … log log Disk Single multi-core server NIC bottleneck can be avoided • Single 10Gb NIC – Multiple queue, scale to many cores – Target: 100B KV pair => 10M/req/sec • Use network stack efficiently – Pipeline requests – Avoid copying cost Disk bottleneck can be avoided • 10M/puts/sec => 1GB logs/sec! • Single disk Write throughput Cost Mainstream Disk 100-300 MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB • Multiple disks: split log – See paper for details Single multi-core server DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 7 6 Cache-craftiness goes 1.5X father, including the cost of: • Network • Disk 5 4 3 2 1 0 Binary Masstree DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 9 Cache-craftiness goes 1.7X father! 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 6 5 4 B A 3 C O 𝑙𝑜𝑔2𝑁 serial DRAM latencies! Y 2 X 1 Z … 0 Binary VoltDB 10M keys => 2.7 us/lookup 380K lookups/core/sec DRAM latency – Lock-free 4-way tree • Concurrency: same as binary tree • One cache line per node => 3 KV / 4 children X A B Y Z … … … ½ levels as binary tree ½ DRAM latencies as binary tree 4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree 4-tree may perform terribly! • Unbalanced: O 𝑁 serial DRAM latencies – e.g. sequential inserts A B C D E F O(N) levels! G H I … • Want balanced tree w/ wide fanout B+tree – Wide and balanced • Balanced! • Concurrent main memory B+tree [OLFIT] – Optimistic concurrency control: version technique – Lookup/scan is lock-free – Puts hold ≤ 3 per-node locks Wide fanout B+tree is 11% slower! 140M short KV, put-only Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree • 4-tree: each internal node is full • B+tree: nodes are ~75% full • Serial DRAM latencies >= 4-tree Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree B+tree – Software prefetch • Same as [pB+-trees] 4 lines = 1 line • Masstree: B+tree w/ fanout 15 => 4 cache lines • Always prefetch whole node when accessed • Result: one DRAM latency per node vs. 2, 3, or 4 B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Concurrent B+tree problem • Lookups retry in case of a concurrent insert insert(B) A C D A C D A B C D • Lock-free 4-tree: not a problem – keys do not move around – but unbalanced Intermediate state! B+tree optimization - Permuter • Keys stored unsorted, define order in tree nodes Permuter: 64-bit integer A C 0 12 insert(B) D A C D B A C D B … 0 31 2 … • A concurrent lookup does not need to retry – Lookup uses permuter to search keys – Insert appears atomic to lookups B+tree with permuter 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 Improve by 4% 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Performance drops dramatically when key length increases Throughput (req/sec, millions) Short values, 50% updates, @16 cores, no logging 9 8 7 Why? Stores key suffix indirectly, thus each key comparison • compares full key • extra DRAM fetch 6 5 4 3 2 1 0 8 16 Keys differ in last 8B 24 Key length 32 40 48 Masstree – Trie of B+trees • Trie: a tree where each level is indexed by fixed-length key fragment • Masstree: a trie with fanout 264, but each trie node is a B+tree … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] • Compress key prefixes! B+tree, indexed by k[16:23] Case Study: Keys share P byte prefix – Better than single B+tree 𝑃 • trie levels 8 • each has one node only A single B+tree with 8B keys … Complexity DRAM access Masstree O(log N) O(log N) Single B+tree O(P log N) O(P log N) Masstree performs better for long keys with prefixes Throughput (req/sec, millions) Short values, 50% updates, @16 cores, no logging 10 9 8 7 6 5 4 3 2 1 0 8B key comparison vs. full key comparison 8 16 24 Key length 32 Masstree B+tree 40 48 Does trie of B+trees hurt short key performance? Throughput (req/sec, millions) 140M short KV, put-only, @16 cores 8% faster! More efficient code – internal node handle 8B keys only 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Evaluation • Masstree compare to other systems? • Masstree compare to partitioned trees? – How much do we pay for handling skewed workloads? • Masstree compare with hash table? – How much do we pay for supporting range queries? • Masstree scale on many cores? Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) 12 10 8 Redis: no range queries 6 4 Unfair: both have a richer data and query model 2 0 0.04 0.22 MongoDB VoltdB Redis Memcached Masstree Multi-core – Partition among cores? • Multiple instances, one unique set of keys per inst. – Memcached, Redis, VoltDB B A Y C X Z • Masstree: a single shared tree – each core can access all keys – reduced imbalance B A C Y X Z A single Masstree performs better for skewed workloads Throughput (req/sec, millions) 140M short KV, read-only, @16 cores, w/ network 12 No remote DRAM access No concurrency control 10 Masstree 16 partitioned Masstrees 8 Partition: 80% idle time 1 partition: 40% 15 partitions: 4% 6 4 2 0 0 1 2 One partition receives δ times more queries 3 4 5 δ 6 7 8 9 Cost of supporting range queries • Without range query? One can use hash table – No resize cost: pre-allocate a large hash table – Lock-free: update with cmpxchg – Only support 8B keys: efficient code – 30% full, each lookup = 1.1 hash probes • Measured in the Masstree framework – 2.5X the throughput of Masstree • Range query costs 2.5X in performance Scale to 12X on 16 cores Throughput (req/sec/core, millions) Short KV, w/o logging 0.7 Perfect scalability 0.6 0.5 0.4 Get 0.3 0.2 • Scale to 12X • Put scales similarly • Limited by the shared memory system 0.1 0 1 2 4 Number of cores 8 16 Related work • • • • [OLFIT]: Optimistic Concurrency Control [pB+-trees]: B+tree with software prefetch [pkB-tree]: store fixed # of diff. bits inline [PALM]: lock-free B+tree, 2.3X as [OLFIT] • Masstree: first system combines them together, w/ new optimizations – Trie of B+trees, permuter Summary • Masstree: a general-purpose highperformance persistent KV store • 5.8 million puts/sec, 8 million gets/sec – More comparisons with other systems in paper • Using cache-craftiness improves performance by 1.5X Thank you!