MemC3: Compact and Concurrent MemCache with Dumber Caching and Smarter Hashing Bin Fan, David G. Andersen, Michael Kaminsky Presenter: Son Nguyen Memcached internal Memcached: Core Data Structures •• LRU caching using chaining• Hashtable and LRU Eviction: Key-Value Index: – Doubly-linked lists doubly linked list – Chaining hash table Hash table w/ chaining KV Doubly-linked-list (for each slab) L R U h e a d e r KV KV KV KV Goals • Reduce space overhead (bytes/key) • Improve throughput (queries/sec) • Target read-intensive workload with small objects • Result: 3X throughput, 30% more objects Doubly-linked-list’s problems • At least two pointers per item -> expensive • Both read and write change the list’s structure -> need locking between threads (no concurrency) Solution: CLOCK-based LRU • Approximate LRU • Multiple readers/single writer • Circular queue instead of linked list -> less space overhead CLOCK example Originally: entry recency entry Read(kd): recency Write(kf, vf): entry recency Write(kg, vg): entry recency (ka, va) (kb, vb) 1 0 (ka, va) (kb, vb) 1 0 (ka, va) (kb, vb) 1 1 (kg, vg) (kb, vb) 0 1 (kc, vc) (kd, vd) (ke, ve) 1 1 0 (kc, vc) (kd, vd) (ke, ve) 1 (kf, vf) 0 (kf, vf) 0 0 0 (kd, vd) (ke, ve) 0 0 (kd, vd) (ke, ve) 1 1 Chaining Hashtable’s problems • Use linked list -> costly space overhead for pointers • Pointer dereference is slow (no advantage from CPU cache) • Read is not constant time (due to possibly long list) Solution: Cuckoo Hashing • Use 2 hashtables • Each bucket has exactly 4 slots (fits in CPU cache) • Each (key, value) object therefore can reside at one of the 8 possible slots Cuckoo Hashing HASH1(ka) (ka,va) HASH2(ka) Cuckoo Hashing • Read: always 8 lookups (constant, fast) • Write: write(ka, va) – Find an empty slot in 8 possible slots of ka – If all are full then randomly kick some (kb, vb) out – Now find an empty slot for (kb, vb) – Repeat 500 times or until an empty slot is found – If still not found then do table expansion Cuckoo Hashing X X Insert a: X HASH1(ka) ba X X X X X X X X X X c X X X X X (ka,va) HASH2(ka) X X X X X Cuckoo Hashing X X X a X Insert b: HASH1(kb) X X X X X X X X X cb X X X X X (kb,vb) HASH2(kb) X X X X X Cuckoo Hashing X X X a X Insert c: X HASH1(kc) c X X X X X X X X b X X X X X (kc,vc) HASH2(kc) X X Done !!! X X X Cuckoo Hashing • Problem: after (kb, vb) is kicked out, a reader might attempt to read (kb, vb) and get a false cache miss • Solution: Compute the kick out path (Cuckoo path) first, then move items backward • Before: (b,c,Null)->(a,c,Null)->(a,b,Null)->(a,b,c) • Fixed: (b,c,Null)->(b,c,c)->(b,b,c)->(a,b,c) Cuckoo path X X Insert a: X HASH1(ka) b X X X X X X X X X X c X X X X X (ka,va) HASH2(ka) X X X X X Cuckoo path backward insert X X Insert a: X HASH1(ka) ba X X X X X X X X X X c X X X X X (ka,va) HASH2(ka) X X X X X Cuckoo’s advantages • • • • • Concurrency: multiple readers/single writer Read optimized (entries fit in CPU cache) Still O(1) amortized time for write 30% less space overhead 95% table occupancy Evaluation 68% throughput improvement in all hit case. 235% for all miss Evaluation End-to-end Performance 3x throughput on “real” workload 16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed 50 remote clients generate workloads max tput 4.3 MOPS 5" MQPS% 4" MemC3 max tput 1.5 MOPS 3" 2" Memcached 1" Sharding 0" 1" max tput 0.6 MOPS 2" 4" 6" 8" 10" 12" Number%of%Server%Threads%% 14" 16" Discussion • Write is slower than chaining Hashtable – Chaining Hashtable: 14.38 million keys/sec – Cuckoo: 7 million keys/sec • Idea: finding cuckoo path in parallel – Benchmark doesn’t show much improvement • Can we make it write-concurrent?