Design Patterns for Tunable and Efficient SSD-based Indexes Ashok Anand, Aaron Gember-Jacobson, Collin Engstrom, Aditya Akella 1 Large hash-based indexes ≈20K lookups and inserts per second (1Gbps link) ≥ 32GB hash table WAN optimizers [Anand et al. SIGCOMM ’08] De-duplication systems [Quinlan et al. FAST ‘02] Video Proxy [Anand et al. HotNets ’12] 2 Use of large hash-based indexes Where to store the indexes? WAN optimizers De-duplication systems Video Proxy 3 Where to store the indexes? 8x less 25x less SSD 4 What’s the problem? • Need domain/workload-specific optimizations for SSD-based index with↑ performance and ↓overhead • Existing designs have… – Poor flexibility – target a specific point in the cost-performance spectrum – Poor generality – only apply to specific workloads or data structures 5 Our contributions • Design patterns that ensure: – High performance – Flexibility – Generality • Indexes based on these principles: – SliceHash – SliceBloom – SliceLSH 6 Outline Problem statement • Limitations of state-of-the-art • SSD architecture • Parallelism-friendly design patterns – SliceHash (streaming hash table) • Evaluation 7 State-of-the-art SSD-based index • BufferHash [Anand et al. NSDI ’10] – Designed for high throughput In-memory incarnation 0 KA,VA 1 2 KC,VC 3 KB,VB Bloom filter 0 1 K,V 2 K,V 3 K,V 0 K,V 1 K,V 2 K,V 3 0 K,V 1 K,V 2 3 K,V #(2K ) incarnation 4 bytes per K/V pair! 16 page reads in worst case! (average: ≈1) 8 State-of-the-art SSD-based index • SILT [Lim et al. SOSP ‘11] – Designed for low memory + high throughput Target specific workloads and objectives 0 KA,VA 0 K,V K,V 1 Hash KB,VB 1 and generality → poor flexibility K,V table 2 3 KC,VC 2 3 K,V K,V Index Log Hash Sorted Do not leverage internal parallelism ≈0.7 bytes per K/V pair 33 page reads in worst case! (average: 1) High CPU usage! 9 SSD Architecture Flash mem package 1 Plane 2 Plane 1 Block 1 Block 2 Page 1 Page 1 Page 2 Page 2 Plane 1 Data register Plane 2 How does the SSD architecture … Die n Die 1 inform our design patterns? … Flash mem pkg 4 Channel 1 … SSD controller Channel 32 Flash mem pkg 125 Flash mem pkg 126 … Flash mem pkg 128 10 Four design principles SliceHash Flash memory package 1 Block 1 2 Page 1 Page 2 Flash memory … package 4 Channel 1 … I. Store related entries on the same page II. Write to the SSD at block granularity III. Issue large reads and large writes IV. Spread small reads across channels Channel 32 11 I. Store related entries on the same page • Many hash table incarnations, like BufferHash #(5K ) Incarnation Multiple page Page reads per lookup! Sequential slots from a specific incarnation 0: K,V 4 0: K,V 4: K,V 0: K,V 4: K,V 1: K,V 5: K,V 1 5 1: K,V 5: K,V 2 6: K,V 2: K,V 6: K,V 2: K,V 6 3: K,V 7: K,V 3: K,V 7: K,V 3 7: K,V 12 I. Store related entries on the same page • Many hash table incarnations, like BufferHash • Slicing: store same hash slot from all incarnations on the same page 5 Only 1 page read per Slice lookup! Page Specific slot from all incarnations 0: K,V 1: 4K,V 0: 2K,V 4: K,V 3: 0: 4K,V 4: 5: K,V 6: K,V 7: K,V 1: 0: K,V 5: 1K,V 2: 1K,V 3: 5K,V 1: 4: K,V 5: 5K,V 6: K,V 7: K,V 0: 2K,V 6: 1: K,V 2: K,V 6: 3K,V 2: 4: K,V 5: 6K,V 6 7: K,V Incarnation 3: K,V 7: K,V 3: K,V 7: K,V 3 7: K,V 13 II. Write to the SSD at block granularity • Insert into a hash table incarnation in RAM • Divide the hash table so all slices SliceTable fit into one block 0 1 2 3 4 5 6 7 KA,VA KD,VD KF,VF KE,VE KC,VC KB,VB Incarnation Block 0: K,V 1: K,V 2 3: K,V 4 5: K,V 6: K,V 7: K,V 0: K,V 1 2: K,V 3: K,V 4: K,V 5 6: K,V 7: K,V 0: K,V 1: K,V 2: K,V 3 4: K,V 5: K,V 6 7: K,V 14 MB/second read III. Issue large reads and large writes Package 300 parallelism 200 Channel parallelism 100 0 1 Page size Package 1 Reg Page Channel 1 Channel 2 2 4 8 16 32 64 128 Read size (KB) Package 2 Reg Page Package 3 Reg Page Package 4 Reg Page 15 MB/second written III. Issue large reads and large writes 200 150 Block size 128KB Writes 256KB Writes 512KB Writes 100 50 0 2 6 10 14 18 22 26 30 # threads SSD assigns consecutive chunks Channel parallelism (4 pages/8KB) to different channels 16 III. Issues large reads and large writes 0 1 KA,VA 2 KD,VD 3 KF,VF 0: K,V 1: K,V 2 3: K,V • Read entire SliceTable (Block) into RAM • Write entire SliceTable onto SSD 0: K,V 1 2: K,V 3: K,V 0: K,V 1: K,V 2: K,V 3 0 1: KA,VA 2: KD,VD 3: KF,VF 0: K,V 1: K,V 2 3: K,V 0: K,V 1 2: K,V 3: K,V 0: K,V 1: K,V 2: K,V 3 4 5: K,V 6: K,V 7: K,V 4: K,V 5 6: K,V 7: K,V 4: K,V 5: K,V 6 7: K,V 17 IV. Spread small reads across channels • Recall: SSD writes consecutive chunks (4 pages) of a block to different channels – Use existing techniques to reverse engineer [Chen et al. HPCA ‘11] – SSD uses write-order mapping channel for chunk i = i modulo (# channels) 18 IV. Spread small reads across channels • Estimate channel using slot # and chunk size • Attempt to schedule 1 read per channel 2 1 4 5 0 1 (slot ( # * pages per slot) modulo (# channels * pages per chunk) 4 1 Channel 0 Channel 1 Channel 2 Channel 3 19 SliceHash summary Specific slot from all incarnations 0 1 2 3 4 5 6 7 KA,VA KD,VD KF,VF KE,VE KC,VC KB,VB In-memory incarnation Slice Page Block Incarnation 0: K,V 1: K,V 2 3: K,V 4 5: K,V 6: K,V 7: K,V 0: K,V 1 2: K,V 3: K,V 4: K,V 5 6: K,V 7: K,V Read/write when updating 0: K,V 1: K,V 2: K,V 3 4: K,V 5: K,V 6 7: K,V SliceTable 20 Evaluation: throughput vs. overhead See paper for theoretical analysis ↑15% ↓12% 5 4 3 2 1 0 8B key 8B value ↑6.6x 50% insert 50% lookup ↑2.8x 80 60 40 20 0 CPU utilization (%) 140 120 100 80 60 40 20 0 2.26Ghz 4-core Memory (bytes/entry) Throughput (K ops/sec) 128GB Crucial M4 21 Evaluation: flexibility SILT Use multiple SSDs for even ↓ memory use and ↑ throughput BufferHash SH 32 Inc. SH 48 Inc. 140 120 100 80 60 40 20 0 SH 64 Inc. Throughput (K ops/sec) SILT BufferHash SH 16 Inc. SH 32 Inc. SH 48 Inc. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 SH 64 Inc. Memory (bytes/entry) • Trade-off memory for throughput SH 16 Inc. 50% insert 50% lookup 22 Evaluation: generality Throughput (K ops/sec) • Workload may change 1200 1000 800 600 400 200 0 Lookup-only Mixed Insert-only Memory (bytes/entry) 4 Constantly 3 2low! 1 0 SH BH SILT CPU utilization (%) 100 75 Decreasing! 50 25 0 SH BH SILT 23 Summary • Present design practices for low cost and high performance SSD-based indexes • Introduce slicing to co-locate related entries and leverage multiple levels of SSD parallelism • SliceHash achieves 69K lookups/sec (≈12% better than prior works), with consistently low memory (0.6B/entry) and CPU (12%) overhead 24 Evaluation: theoretical analysis • Parameters – 16B key/value pairs – 80% table utilization – 32 incarnations – 4GB of memory – 128GB SSD – 0.31ms to read a block – 0.83ms to write a block – 0.15ms to read a page overhead 0.6 B/entry cost avg: ≈5.7μs worst: 1.14ms cost avg & worst: 0.15ms 25 Evaluation: theoretical analysis BufferHash 4B/entry overhead 0.6 B/entry cost avg: ≈0.2us worst: 0.83ms avg: ≈5.7μs worst: 1.14ms cost avg: ≈0.15ms worst: 4.8ms avg & worst: 0.15ms 26