hashgpu - IDAV - University of California, Davis

REAL-TIME PARALLEL HASHING ON THE
GPU
Dan A. Alcantara
Andrei Sharf
Fatemeh Abbasinejad
Shubhabrata Sengupta
Michael Mitzenmacher*
John D. Owens
Nina Amenta
University of California, Davis
Harvard University*
1
MOTIVATION
GLIFT: GENERIC, EFFICIENT,
RANDOM-ACCESS GPU DATA
STRUCTURES
LEFOHN ET AL. [2006]
PERFECT SPATIAL HASHING
LEFEBVRE AND HOPPE [2006]
REAL-TIME KD-TREE
CONSTRUCTION ON GRAPHICS
HARDWARE
ZHOU ET AL. [2008]
MOTIVATION
VOXELIZED LUCY MESH
IMAGE FEATURES
10243 voxel grid ~ 1 billion cells
140K pixels
3.5 million voxels (0.33%)
946 feature points (0.67%)
3
MOTIVATION
SORTED
ARRAY
PERFECT SPATIAL
HASHING
METHOD
SPACE FOR
n ITEMS
n
~n
~1.42n
RETRIEVAL
O(lg(n))
O(1)
O(1)
Binary search
No collisions
Up to 3 collisions
Real-time
Pre-processed
Real-time
GPU
CPU
GPU
BUILDING
OUR
+ offset table
4
OUTLINE
• Two-level structure
– Input key/value pairs
(items), with unique keys
1. Split into buckets
Fits in fast shared
memory
2. Parallel Cuckoo Hashing
Ensures O(1) retrieval
KEYS & VALUES
VOXELIZED
LUCY
h(k)
3.5M VOXELS
BUCKETS
OUR
STRUCTURE
5M SLOTS
26.6 MS
(GTX 280)
CUCKOO HASHES
HASH TABLES & FUNCTIONS
5
TRADITIONAL HASHING: PROBING
KEYS
G
E? E F?
HASH FUNCTION
h(k)
HASH TABLE
A
C B D E
WAIT
WAIT
WAIT
6
PERFECT SPATIAL HASHING
• Perfect mapping gives
O(1) retrieval
• Constructs collision-free
mapping:
– h1(k) indexes into
auxiliary offset table
– Offset removes
collisions from h2(k)
• Offset table built in
specific order
INPUT
DATA
h1(k)
h2(k)
OFFSET
TABLE
+
HASH
TABLE
LEFEBVRE & HOPPE [2006]
7
CUCKOO HASHING
• Use d sub-tables, each
with randomly generated
hash function
INPUT
h1(k)
• Two keys unlikely to
always collide
E
• Tries to find permutation
without conflicts
A
B
EC?
D
E
h2(k)
AAAEE
DE E
B C
D D
E
B
A
• Retrieve by looking at d
possible locations
C
T1
T2
PAGH AND RODLER [2001]
8
CUCKOO HASHING
• Sequential insertion:
1. Try empty slots first
2. Evict if none available
3. Evicted key checks its
other locations
4. Recursively evict
• Assume impossible after
O(lg n) iterations
– Rebuild using new hash
functions
A
h1(k)
B
C
D
E
h2(k)
A
D
C
D
B
B
A
C
PAGH AND RODLER [2001]
9
CUCKOO HASHING
• Sequential insertion:
1. Try empty slots first
2. Evict if none available
3. Evicted key checks its
other locations
4. Recursively evict
• Assume impossible after
O(lg n) iterations
– Rebuild using new hash
functions
A
h1(k)
E
B
C
D
E
h2(k)
A
D
D
E
B
B
C
PAGH AND RODLER [2001]
10
CUCKOO HASHING
• Sequential insertion:
1. Try empty slots first
2. Evict if none available
3. Evicted key checks its
other locations
4. Recursively evict
• Assume impossible after
O(lg n) iterations
– Rebuild using new hash
functions
A
hh11(k)*
(k)
B
C
D
E
hh22(k)*
(k)
A
D
B
C
PAGH AND RODLER [2001]
11
CUCKOO HASHING
• For d=2 sub-tables:
– Proven high chance of
success with 2n+ε slots
– Expect O(1) iterations
INPUT
h1(k)
h3(k)
h2(k)
• For d=3 sub-tables:
– Hard to get theoretical
bounds
– In practice, high
chance of success with
1.1n+ε slots
12
PIPELINE
• Cuckoo Hashing issues:
1. Reads & writes
throughout table
2. Expensive rebuilds
PHASE 1
INPUT
h(k)
h1(k)
h2(k)
BUCKETS
• Two-level structure
– Group into buckets with
< 512 items
– Utilize thread blocks
– Each cuckoo table fits
in shared memory
PHASE 2
CUCKOO HASHES
13
PHASE 1: PARTITIONING
• Group into buckets
of < 512 items using h(k)
ITEMS
• Allocate enough buckets to
get average 80% load
• Rearranges data to
coalesce reads in Phase 2
BUCKETSDATA
REARRANGED
14
PHASE 1: PARTITIONING
• Initially:
– h(k) = k mod |buckets|
• Re-distribute if any bucket gets > 512 items
– 125 restarts/25000 trials (0.5%) for 5 million
random items
– h(k) = ((a+bk) mod p) mod |buckets|
15
PHASE 1: PARTITIONING
1. Allocate buckets
2. Compute item buckets
using h(k)
3. Determine bucket sizes
– Orders items in same
bucket
4. Reserve contiguous
chunk for each bucket
5. Move items
KEYS
POSITION
A B C D E FGH I J K L MNO
1
0
3 2
h(k)
5
6
4
ATOMIC ADD
BUCKET
SIZES
4
PREFIX SUM
BUCKET
0
OFFSETS
5
11
PACKED
BUCKET
EA I HN
DATA
16
PHASE 2: CUCKOO HASHING
GLOBAL MEMORY
BUCKET DATA
A
B
C
E
D
F
G
H
• Thread block per bucket
• Performed in shared
memory to reduce
overhead
T1
T2
SHARED MEMORY
SINGLE BUCKET’S
T3
• Three sub-tables for
better occupancy
CUCKOO TABLES
17
PHASE 2: CUCKOO HASHING
• Generate hash functions
A
B
C
E
D
F
G
H
g1(k)
g2(k)
g3(k)
T1
T2
T3
SHARED MEMORY
SINGLE BUCKET’S
• Parallelized construction
1. Simultaneously insert
2. Synchronize block
3. If evicted, repeat for
other sub-tables
• Fail after 25 iterations
through all 3 sub-tables
CUCKOO TABLES
18
PHASE 2: CUCKOO HASHING
A
B
C
E
D
F
G
H
g1(k)
g2(k)
g3(k)
D
B
F
E
H
A
C
T2
T3
• In trials, average of 5.5
iterations
– Nearly all converge
with first functions
– Succeeded with < 2
new sets of functions
G
T1
SHARED MEMORY
SINGLE BUCKET’S
CUCKOO TABLES
19
PHASE 2: CUCKOO HASHING
• At end of phase, save out to global memory:
1. Cuckoo hash functions
2. Rearranged sub-tables
SHARED MEMORY
BUCKETS’ TABLES
GLOBAL MEMORY
HASH FUNCTIONS
GLOBAL MEMORY
INTERLEAVED CUCKOO TABLES
20
HASH RETRIEVALS
• Look in the 3 possible
locations:
1. Compute bucket
QUERY
k?
h(k)
2. Retrieve hash
functions
3. Check each slot,
stopping early if item
found
VALUE
vk
21
PIPELINE: LUCY DATASET
INPUT
VOXELS
PHASE 1
PHASE 2
REARRANGED DATA
CUCKOO HASH TABLES
VOXELIZED LUCY
ITEM DISTRIBUTION
CUCKOO SUB-TABLES
22
TIMING RESULTS: LUCY DATASET
60
GPU Hash: Construction
GPU Hash: Retrieval
Sorted array: Radix sort
Sorted array: Binary search
CPU PSH: Retrieval
50
Milliseconds
40
30
20
10
0
0
0.5
1
1.5
2
2.5
3
3.5
Key-value pairs (millions)
•
•
Timed on EVGA GTX 280 SSC
All items retrieved in shuffled order, in parallel
23
TIMING RESULTS: RANDOMIZED DATA
180
GPU Hash: Construction
160
GPU Hash: Retrieval
Sorted array: Radix sort
140
Sorted array: Binary search
CPU PSH: Retrieval
Milliseconds
120
100
80
60
40
20
0
0
2
4
6
8
10
Key-value pairs (millions)
24
TIMING RESULTS: STEP BREAKDOWN
30
Cuckoo hashing
Assigning keys to buckets and counting
25
Shuffling the points into the buckets
Initialization
Milliseconds
20
Determining bucket data locations
15
10
5
0
0
1
2
3
4
5
6
7
8
9
10
Key-value pairs (Millions)
25
HASH VARIATIONS
VK
OXELS
EYS
PALUES
OINTS
V
26
MULTI-VALUE HASH
VOXELS
POINTS
MULTI-VALUE HASH
27
COMPACTING HASH
VOXELS
0 1 2 3 4 5 6 7 8 9
COMPACTING HASH
AVG NORMAL
AVG COLOR
# POINTS
28
SPATIAL HASHING
29
GEOMETRIC HASHING
30
GEOMETRIC HASHING
31
TRADE-OFFS
SPACE
UTILIZATION
HASH
TABLE
1. Bucket size & occupancy
2. Number of sub-tables
3. Cuckoo table sizes
CONSTRUCTION RETRIEVAL
ONSTRUCTION
&
• Ordered vs. random
SC
PEED
SPEED
RS
ETRIEVAL
PEED
CONSTRUCTION
ORTED AS
RRAY
SPEED
retrieval
32
SUMMARY
• Introduced method for building large hash tables in
real-time using CUDA
– O(1) random access to sparse data
– Balances space usage, construction speed, and
retrieval speed
• Generalized construction to handle non-unique keys
• Demonstrated use with spatial and geometric hashing
• Future work
– Decrease restart penalty for bucket distribution
– Reduce atomic usage to speed up construction
33
ACKNOWLEDGMENTS
• Thanks to our funding agencies:
– National Science Foundation (awards 0541448, 0625744,
0635250, and 0721491)
– SciDAC Institute for Ultrascale Visualization
• Companies:
– NVIDIA for equipment donations & Shubho’s Graduate Fellowship
– Cisco and Google for research grants
• Data sources:
– Daniel Vlasic
– The Stanford 3D Scanning Repository
– The CAVIAR project
– Matthew Harding (http://www.wherethehellismatt.com/)
2006 Matt Harding Dancing Video is provided courtesy of Cadbury
Adams USA LLC. ©2006 Cadbury Adams USA LLC. All Rights
Reserved. Stride is a registered trademark of Cadbury Adams
USA LLC.
• Timothy Lee for his help in the early stages of the project
34