Chapter 1

advertisement
Multi-dimensional Range Query Processing
on the GPU
Beomseok Nam
Date Intensive Computing Lab
School of Electrical and Computer Engineering
Ulsan National Institution of Science and Technology, Korea
Multi-dimensional Indexing
• One of the core technology in GIS, scientific databases,
computer graphics, etc.
• Access pattern into Scientific Datasets
– Multidimensional Range Query
• Retrieves data that overlaps given range
of values
• Ex) SELECT temperature
FROM dataset
WHERE latitude BETWEEN 20 AND 30
AND
longitude BETWEEN 50 AND 60
– Multidimensional indexing trees
• KD-Trees, KDB-Trees, R-Trees, R*-Trees
• Bitmap index
– Multi-dimensional indexing is one of the
things that do not work well in parallel.
Multi-dimensional Indexing Trees:
R-Tree
• Proposed by Antonin Guttman (1984)
• Stored and indexed via nested MBRs (Minimum Bounding
Rectangles)
• Resembles height-balanced B+-tree
Multi-dimensional Indexing Trees:
R-Tree
• Proposed by A. Guttman
• Stored and indexed via
nested MBRs (Minimum
Bounding Rectangles)
• Resembles height-balanced
B+-tree
An Example Structure of an R-Tree
Source:
http://en.wikipedia.org/wiki/Image:R-tree.jpg
Motivation
• GPGPU has emerged as new HPC parallel computing
paradigm.
• Scientific data analysis applications are major
applications in HPC market.
• A common access pattern into scientific datasets is
multi-dimensional range query.
• Q: How to parallelize multi-dimensional range query
on the GPU?
MPES
(Massively Parallel Exhaustive Scan)
• This is how GPGPU is currently utilized
• Achieve the maximum utilization of GPU.
• Simple, BUT we should access ALL the datasets.
thread[0]
thread[1]
thread[2]
thread[3]
thread[K-1]
…
Divide the Total datasets by the number of threads
Parallel R-Tree Search
• Basic idea
– Compare a given query range with multiple MBRs of child nodes in
parallel
SMP
Node E
SPs
Each SP
compares
an MBB
with a
Query
Q
: ith Query
Global Memory
Node A
Node B
Node D
Node E
Node F
Node C
Node G
Recursive Search on GPU
simply does not work
• Inherently spatial indexing structures such as R-Trees or
KDB-Trees are not well suited for CUDA environment.
• irregular search path and recursion make it hard to
maximize the utilization of GPU
– 48K shared memory will overflow when tree height is > 5
MPTS
(Massively Parallel 3 Phase Scan)
•
Leftmost search
– Choose the leftmost child node no matter how many child nodes overlap
•
Rightmost search
– Choose the rightmost child node no matter how many child nodes overlap
•
Parallel Scanning
– In between two leaf nodes, perform massively parallel scanning
to filter out non-overlapping data elements.
pruned out
pruned out
MPTS improvement
using Hilbert Curve
• Hilbert Curve: Continuous fractal space-filling curve
– Map multi-dimensional points onto 1D curve
• Recursively defined curve
– Hilbert curve of order n is constructed from four copies of the
Hilbert curve of order n-1, properly oriented and connected.
• Spatial Locality Preserving Method
– Nearby points in 2D are also close in the 1D
first order
2nd order
3rd order
Image source: Wikipedia
MPTS improvement
using Hilbert Curve
•
Hilbert curve is well known for it spatial clustering property.
– Sort the data along with Hilbert curve
– Cluster similar data nearby
– The gap between leftmost leaf node and the rightmost leaf
node would be reduced.
– The number of visited nodes would decrease
pruned out
pruned out
MPTS improvement
using Hilbert Curve
•
Hilbert curve is well known for it spatial clustering property.
– Sort the data along with Hilbert curve
– Cluster similar data nearby
– The gap between leftmost leaf node and the rightmost leaf
node would be reduced.
– The number of visited nodes would decrease
Drawback of MPTS
• MPTS reduces the number of leaf nodes to be
accessed, but still it accesses a large number of
leaf nodes that do not have requested data.
• Hence we designed a variant of R-trees that work
on the GPU without stack problem and does not
access leaf nodes that do not have requested data.
– MPHR-Trees (Massively Parallel Hilbert R-Trees)
MPHR-tree (Massively Parallel Hilbert R-Tree)
Bottom-up construction on the GPU
1. Sort data using Hilbert curve index
MPHR-tree (Massively Parallel Hilbert R-tree)
Bottom-up construction on the GPU
2. Build R-trees in a bottom-up fashion
Store maximum Hilbert
value max along with MBR
MPHR-tree (Massively Parallel Hilbert R-tree)
Bottom-up construction on the GPU
2. Build R-trees in a bottom-up fashion
Store maximum Hilbert
value max along with MBR
MPHR-tree (Massively Parallel Hilbert R-tree)
Bottom-up construction on the GPU
• Basic idea
– Parallel reduction to generate an MBR of a parent node and to
get a maximum Hilbert value.
R1
44
level n+1
level n
R4
6
thread[0]
R5
26
R6
44
…
thread[K-1]
SMP0
R7
47
thread[0]
R2
96
R3
159
R8
67
…
R9
96
R10
105
thread[K-1]
thread[0]
SMP1
R11
130
…
R12
159
thread[K-1]
SMP2
build the tree
bottom-up
in parallel
MPHR-tree (Massively Parallel Hilbert R-tree)
Searching on the GPU
•
Iterate leftmost search and parallel scan using Hilbert curve index
– leftmostSearch() visits leftmost search path whose Hilbert
index is greater than the given Hilbert index
lastHilbertIndex = 0;
while(1){
leftmostLeaf=leftmostSearch(lastHilbertIndex, QueryMBR);
if(leftmostLeaf < 0) break;
lastHilbertIndex = parallelScan(leftmostLeaf);
}
Left-most Search
/Find leaf node
1
R1
159
R2
231
5
Left-most Search
6
2
R3
44
level 1
3
level 0
R4
96
R5
159
R6
210
7
4
D1
6
D2
26
D3
44
D4
47
R7
231
D5
67
D6
96
keep parallel scanning
if there exist overlapping leaf nodes
D7
105
D8
130
D9
159
D10
182
D11
200
D12
210
D13
224
D14
231
MPTS vs MPHR-Tree
MPTS
pruned
out
MPHR-Trees
pruned
pruned
out
out
• Search complexity of MPHR-Tree
pruned
out
pruned
out
C  logB N  k  C
k is the number of leaf nodes that have requested data
Braided Parallelism vs Data Parallelism
Braided Parallel Indexing
Data Parallel Indexing
•
Braided Parallel Indexing
– Multiple queries can be processed in parallel.
•
Data Parallel Indexing (Partitioned Indexing)
– Single query is processed by all the CUDA SMPs
– partitioned R-trees
Performance Evaluation
Experimental Setup (MPTS vs MPHR-tree)
• CUDA Toolkit 5.0
• Tesla Fermi M2090 GPU card
– 16 SMPs
– Each SMP has 32 CUDA cores, which enables 512 (16x32)
threads to run concurrently.
• Datasets
– 40 millions of 4D point data sets in uniform, normal, and
Zipf's distribution
Performance Evaluation
MPHR-tree Construction
•
12 K page (fanouts=256), 128 CUDA blocks X64 threads per block
•
It takes only 4 seconds to build R-trees with 40 millions of data
while CPU takes more than 40 seconds. ( 10x speed up )
– Without including memory transfer time, it takes only 50 msec.
(800x speed up)
Performance Evaluation
MPTS Search vs MPES Search
• 12K page (fanouts=256), 128 CUDA blocks X64 threads per
block, selection ratio = 1%
• MPTS outperforms MPES and R-trees on Xeon E5506 (8cores)
– In high dimensions, MPTS accesses more memory blocks but the
number of instructions executed by a warp is smaller than MPES
Performance Evaluation
MPHR-tree Search
•
12 K page (fanouts=256), 128 CUDA blocks X64 threads per block
•
MPHR-tree consistently outperforms other indexing methods
–
In terms of throughput, braided MPHR-Trees shows an order of magnitude higher
performance than multi-core R-trees and MPES.
–
In terms of query response time, partitioned MPHR-trees shows an order of
magnitude faster performance than multi-core R-trees and MPES.
Performance Evaluation
MPHR-tree Search
• In cluster environment, MPHR-Trees show an order of
magnitude higher throughput than LBNL FastQuery library.
– LBNL FastQuery is a parallel bitmap indexing library for multi-core
architectures.
Summary
• Brute-force parallel methods can be refined with
more sophisticated parallel algorithms.
• We proposed new parallel tree traversal
algorithms and showed they significantly
outperform the traditional recursive access to
hierarchical tree structures.
Q&A
• Thank You
MPTS improvement
using Sibling Check
• When a current node doesn’t have any overlapping children,
check sibling nodes!
– It’s always better to prune out tree nodes in upper level.
CUDA
• GPGPU (General Purpose Graphics Processing Unit)
– CUDA is a set of developing tools to create applications
that will perform execution on GPU
– GPUs allow creation of very large number of concurrently
executed threads at very low system resource cost.
– CUDA also exposes fast shared memory (48KB) that can be
shared between threads.
Tesla M2090 : 16X32 = 512 cores
Image source: Wikipedia
Grids and Blocks of CUDA Threads
•
•
A kernel is executed as
a grid of thread blocks
– All threads share data memory
space
A thread block is a batch of
threads that can cooperate with
each other by:
–
•
Device
Grid 1
Kernel
1
Synchronizing their execution
•
–
Host
For hazard-free shared
memory accesses
Efficiently sharing data through
a low latency shared memory
Two threads from two different
blocks cannot cooperate
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NVIDIA
Download