MemC3 slides

advertisement
MemC3: Compact and Concurrent
MemCache with Dumber
Caching and Smarter Hashing
Bin Fan, David G. Andersen, Michael
Kaminsky
Presenter: Son Nguyen
Memcached
internal
Memcached:
Core Data
Structures
•• LRU
caching
using chaining• Hashtable
and
LRU Eviction:
Key-Value
Index:
– Doubly-linked lists
doubly
linked
list
– Chaining
hash table
Hash table
w/ chaining
KV
Doubly-linked-list
(for each slab)
L
R
U
h
e
a
d
e
r
KV
KV
KV
KV
Goals
• Reduce space overhead (bytes/key)
• Improve throughput (queries/sec)
• Target read-intensive workload with small
objects
• Result: 3X throughput, 30% more objects
Doubly-linked-list’s problems
• At least two pointers per item -> expensive
• Both read and write change the list’s structure
-> need locking between threads (no
concurrency)
Solution: CLOCK-based LRU
• Approximate LRU
• Multiple readers/single writer
• Circular queue instead of linked list -> less
space overhead
CLOCK example
Originally:
entry
recency
entry
Read(kd):
recency
Write(kf, vf):
entry
recency
Write(kg, vg):
entry
recency
(ka, va) (kb,
vb)
1
0
(ka, va) (kb,
vb)
1
0
(ka, va) (kb,
vb)
1
1
(kg, vg) (kb,
vb)
0
1
(kc, vc) (kd, vd) (ke, ve)
1
1
0
(kc, vc) (kd, vd) (ke, ve)
1
(kf, vf)
0
(kf, vf)
0
0
0
(kd, vd) (ke, ve)
0
0
(kd, vd) (ke, ve)
1
1
Chaining Hashtable’s problems
• Use linked list -> costly space overhead for
pointers
• Pointer dereference is slow (no advantage
from CPU cache)
• Read is not constant time (due to possibly
long list)
Solution: Cuckoo Hashing
• Use 2 hashtables
• Each bucket has exactly 4 slots (fits in CPU
cache)
• Each (key, value) object therefore can reside at
one of the 8 possible slots
Cuckoo Hashing
HASH1(ka)
(ka,va)
HASH2(ka)
Cuckoo Hashing
• Read: always 8 lookups (constant, fast)
• Write: write(ka, va)
– Find an empty slot in 8 possible slots of ka
– If all are full then randomly kick some (kb, vb) out
– Now find an empty slot for (kb, vb)
– Repeat 500 times or until an empty slot is found
– If still not found then do table expansion
Cuckoo Hashing
X
X
Insert a:
X
HASH1(ka)
ba
X
X
X
X
X
X
X
X
X
X
c
X
X
X
X
X
(ka,va)
HASH2(ka)
X
X
X
X
X
Cuckoo Hashing
X
X
X
a
X
Insert b:
HASH1(kb)
X
X
X
X
X
X
X
X
X
cb
X
X
X
X
X
(kb,vb)
HASH2(kb)
X
X
X
X
X
Cuckoo Hashing
X
X
X
a
X
Insert c:
X
HASH1(kc)
c
X
X
X
X
X
X
X
X
b
X
X
X
X
X
(kc,vc)
HASH2(kc)
X
X
Done !!!
X
X
X
Cuckoo Hashing
• Problem: after (kb, vb) is kicked out, a reader
might attempt to read (kb, vb) and get a false
cache miss
• Solution: Compute the kick out path (Cuckoo
path) first, then move items backward
• Before: (b,c,Null)->(a,c,Null)->(a,b,Null)->(a,b,c)
• Fixed: (b,c,Null)->(b,c,c)->(b,b,c)->(a,b,c)
Cuckoo path
X
X
Insert a:
X
HASH1(ka)
b
X
X
X
X
X
X
X
X
X
X
c
X
X
X
X
X
(ka,va)
HASH2(ka)
X
X
X
X
X
Cuckoo path backward insert
X
X
Insert a:
X
HASH1(ka)
ba
X
X
X
X
X
X
X
X
X
X
c
X
X
X
X
X
(ka,va)
HASH2(ka)
X
X
X
X
X
Cuckoo’s advantages
•
•
•
•
•
Concurrency: multiple readers/single writer
Read optimized (entries fit in CPU cache)
Still O(1) amortized time for write
30% less space overhead
95% table occupancy
Evaluation
68% throughput improvement in all hit case. 235% for all miss
Evaluation
End-to-end Performance
3x throughput on “real” workload
16-Byte key, 32-Byte Value, 95% GET, 5% SET, zipf distributed
50 remote clients generate workloads
max tput 4.3 MOPS
5"
MQPS%
4"
MemC3
max tput 1.5 MOPS
3"
2"
Memcached
1"
Sharding
0"
1"
max tput 0.6 MOPS
2"
4"
6"
8"
10"
12"
Number%of%Server%Threads%%
14"
16"
Discussion
• Write is slower than chaining Hashtable
– Chaining Hashtable: 14.38 million keys/sec
– Cuckoo: 7 million keys/sec
• Idea: finding cuckoo path in parallel
– Benchmark doesn’t show much improvement
• Can we make it write-concurrent?
Download