Cache Craftiness for Fast Multicore Key

Cache Craftiness for Fast Multicore Key-Value Storage Yandong Mao (MIT), Eddie Kohler (Harvard), Robert Morris (MIT) Let’s build a fast key-value store • KV store systems are important – Google Bigtable, Amazon Dynamo, Yahoo! PNUTS • Single-server KV performance matters – Reduce cost – Easier management • Goal: fast KV store for single multi-core server – Assume all data fits in memory – Redis, VoltDB Feature wish list • Clients send queries over network • Persist data across crashes • Range query • Perform well on various workloads – Including hard ones! Hard workloads • Skewed key popularity – Hard! (Load imbalance) • Small key-value pairs – Hard! • Many puts – Hard! • Arbitrary keys – String (e.g. www.wikipedia.org/...) or integer – Hard! First try: fast binary tree Throughput (req/sec, millions) 140M short KV, put-only, @16 cores 4 4 3 3 2 2 1 1 0 • Network/disk not bottlenecks • High-BW NIC • Multiple disks • 3.7 million queries/second! • Better? • What bottleneck remains? • DRAM! Cache craftiness goes 1.5X farther 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 7 6 5 4 3 2 1 0 Binary Masstree Cache-craftiness: careful use of cache and memory Contributions • Masstree achieves millions of queries per second across various hard workloads – – – – Skewed key popularity Various read/write ratios Variable relatively long keys Data >> on-chip cache • New ideas – Trie of B+ trees, permuter, etc. • Full system – New ideas + best practices (network, disk, etc.) Experiment environment • A 16-core server – three active DRAM nodes • Single 10Gb Network Interface Card (NIC) • Four SSDs • 64 GB DRAM • A cluster of load generators Potential bottlenecks in Masstree Network … DRAM … log log Disk Single multi-core server NIC bottleneck can be avoided • Single 10Gb NIC – Multiple queue, scale to many cores – Target: 100B KV pair => 10M/req/sec • Use network stack efficiently – Pipeline requests – Avoid copying cost Disk bottleneck can be avoided • 10M/puts/sec => 1GB logs/sec! • Single disk Write throughput Cost Mainstream Disk 100-300 MB/sec 1 $/GB High performance SSD up to 4.4GB/sec > 40 $/GB • Multiple disks: split log – See paper for details Single multi-core server DRAM bottleneck – hard to avoid 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 7 6 Cache-craftiness goes 1.5X father, including the cost of: • Network • Disk 5 4 3 2 1 0 Binary Masstree DRAM bottleneck – w/o network/disk 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 9 Cache-craftiness goes 1.7X father! 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree DRAM latency – binary tree 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 6 5 4 B A 3 C O 𝑙𝑜𝑔2𝑁 serial DRAM latencies! Y 2 X 1 Z … 0 Binary VoltDB 10M keys => 2.7 us/lookup 380K lookups/core/sec DRAM latency – Lock-free 4-way tree • Concurrency: same as binary tree • One cache line per node => 3 KV / 4 children X A B Y Z … … … ½ levels as binary tree ½ DRAM latencies as binary tree 4-tree beats binary tree by 40% 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree 4-tree may perform terribly! • Unbalanced: O 𝑁 serial DRAM latencies – e.g. sequential inserts A B C D E F O(N) levels! G H I … • Want balanced tree w/ wide fanout B+tree – Wide and balanced • Balanced! • Concurrent main memory B+tree [OLFIT] – Optimistic concurrency control: version technique – Lookup/scan is lock-free – Puts hold ≤ 3 per-node locks Wide fanout B+tree is 11% slower! 140M short KV, put-only Fanout=15, fewer levels than 4-tree, but • # cache lines from DRAM >= 4-tree • 4-tree: each internal node is full • B+tree: nodes are ~75% full • Serial DRAM latencies >= 4-tree Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree B+tree – Software prefetch • Same as [pB+-trees] 4 lines = 1 line • Masstree: B+tree w/ fanout 15 => 4 cache lines • Always prefetch whole node when accessed • Result: one DRAM latency per node vs. 2, 3, or 4 B+tree with prefetch 140M short KV, put-only, @16 cores Beats 4-tree by 9% Balanced beats unbalanced! Throughput (req/sec, millions) 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Concurrent B+tree problem • Lookups retry in case of a concurrent insert insert(B) A C D A C D A B C D • Lock-free 4-tree: not a problem – keys do not move around – but unbalanced Intermediate state! B+tree optimization - Permuter • Keys stored unsorted, define order in tree nodes Permuter: 64-bit integer A C 0 12 insert(B) D A C D B A C D B … 0 31 2 … • A concurrent lookup does not need to retry – Lookup uses permuter to search keys – Insert appears atomic to lookups B+tree with permuter 140M short KV, put-only, @16 cores Throughput (req/sec, millions) 10 Improve by 4% 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Performance drops dramatically when key length increases Throughput (req/sec, millions) Short values, 50% updates, @16 cores, no logging 9 8 7 Why? Stores key suffix indirectly, thus each key comparison • compares full key • extra DRAM fetch 6 5 4 3 2 1 0 8 16 Keys differ in last 8B 24 Key length 32 40 48 Masstree – Trie of B+trees • Trie: a tree where each level is indexed by fixed-length key fragment • Masstree: a trie with fanout 264, but each trie node is a B+tree … B+tree, indexed by k[0:7] … B+tree, indexed by k[8:15] • Compress key prefixes! B+tree, indexed by k[16:23] Case Study: Keys share P byte prefix – Better than single B+tree 𝑃 • trie levels 8 • each has one node only A single B+tree with 8B keys … Complexity DRAM access Masstree O(log N) O(log N) Single B+tree O(P log N) O(P log N) Masstree performs better for long keys with prefixes Throughput (req/sec, millions) Short values, 50% updates, @16 cores, no logging 10 9 8 7 6 5 4 3 2 1 0 8B key comparison vs. full key comparison 8 16 24 Key length 32 Masstree B+tree 40 48 Does trie of B+trees hurt short key performance? Throughput (req/sec, millions) 140M short KV, put-only, @16 cores 8% faster! More efficient code – internal node handle 8B keys only 10 9 8 7 6 5 4 3 2 1 0 Binary 4-tree B+tree +Prefetch +Permuter Masstree Evaluation • Masstree compare to other systems? • Masstree compare to partitioned trees? – How much do we pay for handling skewed workloads? • Masstree compare with hash table? – How much do we pay for supporting range queries? • Masstree scale on many cores? Masstree performs well even with persistence and range queries 20M short KV, uniform dist., read-only, @16 cores, w/ network Memcached: not persistent and no range queries Throughput (req/sec, millions) 12 10 8 Redis: no range queries 6 4 Unfair: both have a richer data and query model 2 0 0.04 0.22 MongoDB VoltdB Redis Memcached Masstree Multi-core – Partition among cores? • Multiple instances, one unique set of keys per inst. – Memcached, Redis, VoltDB B A Y C X Z • Masstree: a single shared tree – each core can access all keys – reduced imbalance B A C Y X Z A single Masstree performs better for skewed workloads Throughput (req/sec, millions) 140M short KV, read-only, @16 cores, w/ network 12 No remote DRAM access No concurrency control 10 Masstree 16 partitioned Masstrees 8 Partition: 80% idle time 1 partition: 40% 15 partitions: 4% 6 4 2 0 0 1 2 One partition receives δ times more queries 3 4 5 δ 6 7 8 9 Cost of supporting range queries • Without range query? One can use hash table – No resize cost: pre-allocate a large hash table – Lock-free: update with cmpxchg – Only support 8B keys: efficient code – 30% full, each lookup = 1.1 hash probes • Measured in the Masstree framework – 2.5X the throughput of Masstree • Range query costs 2.5X in performance Scale to 12X on 16 cores Throughput (req/sec/core, millions) Short KV, w/o logging 0.7 Perfect scalability 0.6 0.5 0.4 Get 0.3 0.2 • Scale to 12X • Put scales similarly • Limited by the shared memory system 0.1 0 1 2 4 Number of cores 8 16 Related work • • • • [OLFIT]: Optimistic Concurrency Control [pB+-trees]: B+tree with software prefetch [pkB-tree]: store fixed # of diff. bits inline [PALM]: lock-free B+tree, 2.3X as [OLFIT] • Masstree: first system combines them together, w/ new optimizations – Trie of B+trees, permuter Summary • Masstree: a general-purpose highperformance persistent KV store • 5.8 million puts/sec, 8 million gets/sec – More comparisons with other systems in paper • Using cache-craftiness improves performance by 1.5X Thank you!

Cache Craftiness for Fast Multicore Key

Related documents

Products

Support

Cache Craftiness for Fast Multicore Key

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib