REAL-TIME PARALLEL HASHING ON THE GPU Dan A. Alcantara Andrei Sharf Fatemeh Abbasinejad Shubhabrata Sengupta Michael Mitzenmacher* John D. Owens Nina Amenta University of California, Davis Harvard University* 1 MOTIVATION GLIFT: GENERIC, EFFICIENT, RANDOM-ACCESS GPU DATA STRUCTURES LEFOHN ET AL. [2006] PERFECT SPATIAL HASHING LEFEBVRE AND HOPPE [2006] REAL-TIME KD-TREE CONSTRUCTION ON GRAPHICS HARDWARE ZHOU ET AL. [2008] MOTIVATION VOXELIZED LUCY MESH IMAGE FEATURES 10243 voxel grid ~ 1 billion cells 140K pixels 3.5 million voxels (0.33%) 946 feature points (0.67%) 3 MOTIVATION SORTED ARRAY PERFECT SPATIAL HASHING METHOD SPACE FOR n ITEMS n ~n ~1.42n RETRIEVAL O(lg(n)) O(1) O(1) Binary search No collisions Up to 3 collisions Real-time Pre-processed Real-time GPU CPU GPU BUILDING OUR + offset table 4 OUTLINE • Two-level structure – Input key/value pairs (items), with unique keys 1. Split into buckets Fits in fast shared memory 2. Parallel Cuckoo Hashing Ensures O(1) retrieval KEYS & VALUES VOXELIZED LUCY h(k) 3.5M VOXELS BUCKETS OUR STRUCTURE 5M SLOTS 26.6 MS (GTX 280) CUCKOO HASHES HASH TABLES & FUNCTIONS 5 TRADITIONAL HASHING: PROBING KEYS G E? E F? HASH FUNCTION h(k) HASH TABLE A C B D E WAIT WAIT WAIT 6 PERFECT SPATIAL HASHING • Perfect mapping gives O(1) retrieval • Constructs collision-free mapping: – h1(k) indexes into auxiliary offset table – Offset removes collisions from h2(k) • Offset table built in specific order INPUT DATA h1(k) h2(k) OFFSET TABLE + HASH TABLE LEFEBVRE & HOPPE [2006] 7 CUCKOO HASHING • Use d sub-tables, each with randomly generated hash function INPUT h1(k) • Two keys unlikely to always collide E • Tries to find permutation without conflicts A B EC? D E h2(k) AAAEE DE E B C D D E B A • Retrieve by looking at d possible locations C T1 T2 PAGH AND RODLER [2001] 8 CUCKOO HASHING • Sequential insertion: 1. Try empty slots first 2. Evict if none available 3. Evicted key checks its other locations 4. Recursively evict • Assume impossible after O(lg n) iterations – Rebuild using new hash functions A h1(k) B C D E h2(k) A D C D B B A C PAGH AND RODLER [2001] 9 CUCKOO HASHING • Sequential insertion: 1. Try empty slots first 2. Evict if none available 3. Evicted key checks its other locations 4. Recursively evict • Assume impossible after O(lg n) iterations – Rebuild using new hash functions A h1(k) E B C D E h2(k) A D D E B B C PAGH AND RODLER [2001] 10 CUCKOO HASHING • Sequential insertion: 1. Try empty slots first 2. Evict if none available 3. Evicted key checks its other locations 4. Recursively evict • Assume impossible after O(lg n) iterations – Rebuild using new hash functions A hh11(k)* (k) B C D E hh22(k)* (k) A D B C PAGH AND RODLER [2001] 11 CUCKOO HASHING • For d=2 sub-tables: – Proven high chance of success with 2n+ε slots – Expect O(1) iterations INPUT h1(k) h3(k) h2(k) • For d=3 sub-tables: – Hard to get theoretical bounds – In practice, high chance of success with 1.1n+ε slots 12 PIPELINE • Cuckoo Hashing issues: 1. Reads & writes throughout table 2. Expensive rebuilds PHASE 1 INPUT h(k) h1(k) h2(k) BUCKETS • Two-level structure – Group into buckets with < 512 items – Utilize thread blocks – Each cuckoo table fits in shared memory PHASE 2 CUCKOO HASHES 13 PHASE 1: PARTITIONING • Group into buckets of < 512 items using h(k) ITEMS • Allocate enough buckets to get average 80% load • Rearranges data to coalesce reads in Phase 2 BUCKETSDATA REARRANGED 14 PHASE 1: PARTITIONING • Initially: – h(k) = k mod |buckets| • Re-distribute if any bucket gets > 512 items – 125 restarts/25000 trials (0.5%) for 5 million random items – h(k) = ((a+bk) mod p) mod |buckets| 15 PHASE 1: PARTITIONING 1. Allocate buckets 2. Compute item buckets using h(k) 3. Determine bucket sizes – Orders items in same bucket 4. Reserve contiguous chunk for each bucket 5. Move items KEYS POSITION A B C D E FGH I J K L MNO 1 0 3 2 h(k) 5 6 4 ATOMIC ADD BUCKET SIZES 4 PREFIX SUM BUCKET 0 OFFSETS 5 11 PACKED BUCKET EA I HN DATA 16 PHASE 2: CUCKOO HASHING GLOBAL MEMORY BUCKET DATA A B C E D F G H • Thread block per bucket • Performed in shared memory to reduce overhead T1 T2 SHARED MEMORY SINGLE BUCKET’S T3 • Three sub-tables for better occupancy CUCKOO TABLES 17 PHASE 2: CUCKOO HASHING • Generate hash functions A B C E D F G H g1(k) g2(k) g3(k) T1 T2 T3 SHARED MEMORY SINGLE BUCKET’S • Parallelized construction 1. Simultaneously insert 2. Synchronize block 3. If evicted, repeat for other sub-tables • Fail after 25 iterations through all 3 sub-tables CUCKOO TABLES 18 PHASE 2: CUCKOO HASHING A B C E D F G H g1(k) g2(k) g3(k) D B F E H A C T2 T3 • In trials, average of 5.5 iterations – Nearly all converge with first functions – Succeeded with < 2 new sets of functions G T1 SHARED MEMORY SINGLE BUCKET’S CUCKOO TABLES 19 PHASE 2: CUCKOO HASHING • At end of phase, save out to global memory: 1. Cuckoo hash functions 2. Rearranged sub-tables SHARED MEMORY BUCKETS’ TABLES GLOBAL MEMORY HASH FUNCTIONS GLOBAL MEMORY INTERLEAVED CUCKOO TABLES 20 HASH RETRIEVALS • Look in the 3 possible locations: 1. Compute bucket QUERY k? h(k) 2. Retrieve hash functions 3. Check each slot, stopping early if item found VALUE vk 21 PIPELINE: LUCY DATASET INPUT VOXELS PHASE 1 PHASE 2 REARRANGED DATA CUCKOO HASH TABLES VOXELIZED LUCY ITEM DISTRIBUTION CUCKOO SUB-TABLES 22 TIMING RESULTS: LUCY DATASET 60 GPU Hash: Construction GPU Hash: Retrieval Sorted array: Radix sort Sorted array: Binary search CPU PSH: Retrieval 50 Milliseconds 40 30 20 10 0 0 0.5 1 1.5 2 2.5 3 3.5 Key-value pairs (millions) • • Timed on EVGA GTX 280 SSC All items retrieved in shuffled order, in parallel 23 TIMING RESULTS: RANDOMIZED DATA 180 GPU Hash: Construction 160 GPU Hash: Retrieval Sorted array: Radix sort 140 Sorted array: Binary search CPU PSH: Retrieval Milliseconds 120 100 80 60 40 20 0 0 2 4 6 8 10 Key-value pairs (millions) 24 TIMING RESULTS: STEP BREAKDOWN 30 Cuckoo hashing Assigning keys to buckets and counting 25 Shuffling the points into the buckets Initialization Milliseconds 20 Determining bucket data locations 15 10 5 0 0 1 2 3 4 5 6 7 8 9 10 Key-value pairs (Millions) 25 HASH VARIATIONS VK OXELS EYS PALUES OINTS V 26 MULTI-VALUE HASH VOXELS POINTS MULTI-VALUE HASH 27 COMPACTING HASH VOXELS 0 1 2 3 4 5 6 7 8 9 COMPACTING HASH AVG NORMAL AVG COLOR # POINTS 28 SPATIAL HASHING 29 GEOMETRIC HASHING 30 GEOMETRIC HASHING 31 TRADE-OFFS SPACE UTILIZATION HASH TABLE 1. Bucket size & occupancy 2. Number of sub-tables 3. Cuckoo table sizes CONSTRUCTION RETRIEVAL ONSTRUCTION & • Ordered vs. random SC PEED SPEED RS ETRIEVAL PEED CONSTRUCTION ORTED AS RRAY SPEED retrieval 32 SUMMARY • Introduced method for building large hash tables in real-time using CUDA – O(1) random access to sparse data – Balances space usage, construction speed, and retrieval speed • Generalized construction to handle non-unique keys • Demonstrated use with spatial and geometric hashing • Future work – Decrease restart penalty for bucket distribution – Reduce atomic usage to speed up construction 33 ACKNOWLEDGMENTS • Thanks to our funding agencies: – National Science Foundation (awards 0541448, 0625744, 0635250, and 0721491) – SciDAC Institute for Ultrascale Visualization • Companies: – NVIDIA for equipment donations & Shubho’s Graduate Fellowship – Cisco and Google for research grants • Data sources: – Daniel Vlasic – The Stanford 3D Scanning Repository – The CAVIAR project – Matthew Harding (http://www.wherethehellismatt.com/) 2006 Matt Harding Dancing Video is provided courtesy of Cadbury Adams USA LLC. ©2006 Cadbury Adams USA LLC. All Rights Reserved. Stride is a registered trademark of Cadbury Adams USA LLC. • Timothy Lee for his help in the early stages of the project 34