Memory –efficient Data Management Policy for Flash-based Key-Value Store Wang Jiangtao 2013-4-12 Outline • Introduction • Related work • Two works – BloomStore[MSST2012] – TBF[ICDE2013] • Summary Key-Value Store • KV store efficiently supports simple operations: Key lookup & KV pair insertion – Online Multi-player Gaming – Data deduplication – Internet services 3 Overview of Key-Value Store • KV store system should provide high access throughput (> 10,000 key lookups/sec) • Replaces traditional relational DBs for its superior scalability & performance. – prefer to use KV store for its simplicity and better scalability • Popular management (index + storage) solution for large volume of records – often implemented through an index structure, mapping Key-> Value Challenge • To meet high throughput demand, the performance of index access and KV pair (data) access is critical – index access : search the KV pair associated with a given “key” – KV pair access: get/put the actual KV pair • Available memory space limits the maximum number of stored KV pairs • Using in-RAM index structure can only address index access performance demand DRAM must be Used Efficiently • 1 TB of data • 4 bytes of DRAM for key-value pair 1000 Index size(GB) 32 B( Data deduplication) => 125 GB! 100 168 B(Tweet) => 24 GB 10 1 KB(Small image) => 4 GB 1 10 100 1000 Per Key-value pair size (bytes) 10000 6 Existing Approach to Speed up Index & KV pair Accesses • Maintain the index structure in RAM to map each key to its KV pair on SSD – RAM size can not scale up linearly to flash size • Keep the minimum index structure in RAM, while storing the rest of the index structure in SSD – On-flash index structure should be designed carefully Space is precious random writes are slow and bad for flash life (wear out) Outline • Introduction • Related work • Two works – BloomStore[MSST2012] – TBF[ICDE2013] • Summary Bloom Filter • Bloom Filter利用位数组表示一个集合,并判断一个元素是否属于这 个集合。初始状态时,m位的位数组的每一位都置为0,Bloom Filter使 用k个相互独立的哈希函数,它们分别将集合中的每个元素映射到 {1,…,m}的范围中。对任意一个元素x,第i个哈希函数映射的位置hi(x) 就会被置为1(1≤i≤k)。注意,如果一个位置多次被置为1,那么只有 第一次会起作用,后面几次将没有任何效果。 • 错误率 • Bloom Filter参数选择 – 哈希函数的个数k、位数组大小m、元素的个数n – 降低错误率 FlashStore[VLDB2010] • Flash as a cache • Components – – – – Write buffer Read cache Recency bit vector Disk-presence bloom filter – Hash table index • Cons – 6 bytes of RAM per key-value pair SkimpyStash[SIGMOD2011] • Components – Write buffer – Hash table Bloom filter using linked list a pointer to the beginning of the linked list of flash • Storing the linked lists on flash – Each pair have a pointer to earlier keys in the log • Cons – Multiple flash page reads for a key lookup – High garbage collection cost Outline • Introduction • Related work • Two works – BloomStore[MSST2012] – TBF[ICDE2013] • Summary MSST2012 Introduction • Key lookup throughput is the bottleneck for data application • Keep an in-RAM large-sized hash table • Move index structure to secondary storage(SSD) – Expensive random write – High garbage collection cost – Bigger storage space BloomStore • BloomStore Design – An extremely low amortized RAM overhead – Provide high key lookup/insertion throughput • Componets – KV Pair write buffer – Active bloom filter a flash page for write buffer – Bloom filter chain many flash pages – Key-range partition a flash “block” BloomStore architecture KV Store Operations • Key Lookup – Active Bloom filter – Bloom filter chain – Lookup cost Parallel lookup • Key Lookup – Read the entire BF chain – Bit-wise AND resultant row – High read throughput h1(ei) h1(ei) .. . h1(ei) Bit-wise AND ei is found Bloom filters in parallel KV Store Operations • KV pair Insertion • KV pair Update – Append a new key-value pair • KV pair Deletion – Insert a null value for the key Experimental Evaluation • Experiment setup – 1TB SSD(PCIe)/32GB(SATA) • Workload Experimental Evaluation • Effectiveness of prefilter – Per KV pair is 1.2 bytes • Linux Workload • Vx Workload Experimental Evaluation • Lookup Throughput – Linux Workload H=96(BF chain length) m=128(the size of a BF) – Vx Workload H=96(BF chain length) m=64(the size of a BF) A prefilter ICDE2013 Motivation • Using flash as a extension cache is cost-effective • The desired size of RAM-cache is too large – Caching policy is memory-efficient • Replacement algorithm achieves comparable performance with existing policies • Caching policy is agnostic to the organization of data on SSD Defects of the existing policy • Recency-based caching algotithm – Clock or LRU – Access data structure and index Defects of the existing policy • Recency-based caching algotithm – Clock or LRU – Access data structure and index System view • DRAM buffer BF – An in-memory data structure to maintain access information (BF) – No special index to locate keyvalue pair • Key-value store – Provide a iterator operation to traverse – Write through Key-Value cache prototype architecture Bloom Filter with deletion(BFD) • BFD – Removing a key from SSD – A bloom filter with deletion – Resetting the bits at the corresponding hash-value in a subset of the hash functions X1 0 1 0 0 1 0 1 0 1 0 1 0 Delete X1 0 1 0 0 0 0 1 0 0 0 1 0 Bloom Filter with deletion(BFD) • Flow chart • Tracking recency information • Cons – False positive polluting the cache – False negative Poor hit ratio Two Bloom sub-Filters(TBF) • • • • Flow chart Dropping many elements in bulk Flip the filter periodically Cons – Keeping rarely-accessed objects polluting the cache – traversal length per eviction Traversal cost • Key-Value Store Traversal – unmarked on insertion – marked on insertion longer stretches of marked objects False positive Evaluation • Experiment setup – two 1 TB 7200 RPM SATA disks in RAID-0 – 80 GB FusionioDrive PCIE X4 – a mixture of 95% read operations and 5% update – Key-value pairs:200 million(256B) • Bloom filter – 4 bits per marked object – a byte per object in TBF – hash function:3 Outline • Introduction • Related work • Two works – BloomStore[MSST2012] – TBF[ICDE2013] • Summary Summary • KV store is particularly suitable for some special applications • Flash will improve the performance of KV store due to its faster access • Some index structure need to be redesign to minimize the RAM size • Don’t just treat flash as disk replacement 33 Thank You!