Presentation - Computer Science

advertisement
Speeding Up Large-Scale Geospatial
Polygon Rasterization on GPGPUs
Jianting Zhang
Department of Computer Science, the City College of New York
jzhang@cs.ccny.cuny.edu
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Outline
Introduction and Motivations
 Background and Related Works
 The Serial Scan-Line Fill Algorithm
 Preprocessing Polygon Collections


Efficient Polygon Rasterization on GPGPUs
Experiments and Results
 Conclusion and Future Work

2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Introduction: Personal HPC-G
A. Clematis, M. Mineter, and R. Marciano. High performance computing with
geographical data. Parallel Computing, 29(10):1275–1279, 2003
“Despite all these initiatives the impact of parallel GIS research has
remained slight…”
“…fundamental problem remains the fact that creating parallel GIS
operations is non-trivial and there is a lack of parallel GIS
algorithms, application libraries and toolkits.”
Marrying GPGPU with GIS – The next generation High-Performance
GIS in a Personal Computing Environment (Zhang 2010, HPDGIS)
•
•
•
Every personal computer is now a parallel machine: CMPs and GPUs
Multi-core CPUs become the mainstream ; the more cores they have, the more
GPU features they have
NVIDIA alone has shipped almost 220 million CUDA-capable GPUs from
2006-2010 (CACM 2010/11)
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Introduction – Personal HPC-G

Chip-Multiprocessors (CMP):
http://en.wikipedia.org/wiki/Multi-core_processor
 Cores/per chip: Dual-core Quad-core Six-core8/10/12
 Chips/per node: 1->24/8
 Intel MIC (32 cores)
 UIUC Rigel Design (1024 core)


Massively parallel GPGPU computing: Hundreds of GPU cores in a GPU card


Nvidia GTX480 (03/2010): 480 cores, 1.4 GHZ, 1.5GB, 177.4 GB/s memory
bandwidth, 1.35 TFlops
Nvidia GTX590 (03/2011): 1024 cores, 1.2 GHZ, 3GB, 327.74 GB/s memory
bandwidth, 2.49 TFlops
Parallel hardware is ever affordable than before …
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Introduction – Personal HPC-G

Geospatial data volumes never stop growing




http://www.goes-r.gov/downloads/GOES-R-Tri.pdf
Spectral (3X)*spatial (4X)* temporal (5X)=60X
Derived thematic data products (vector)



http://www.goes-r.gov/products/baseline.html
http://www.goes-r.gov/products/option2.html
Species distributions and movement data




COM.GEO’10
SSDBM’10
ACMGIS 10
ACMGIS 11
Satellite: e.g., from GOES to GOES-R (2016)
E.g. 300+ millions occurrence records (GBIF)
E.g. 717,057 polygons and 78,929,697 vertices for 4148
birds distribution data (NatureServe)
Animals can move across space and time
ACMGIS’08
ACMGIS’09
GeoInformatics’09
HPDGIS’11
Event Locations, trajectories and O-D data



E.g., Taxi trip records (traces or O-D locations)
0.5 million in NYC and 1.2 million in Beijing per day
From O-D to shortest paths to flow patterns
COM.GEO’10
HPDGIS’10
???
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Motivations
0 2
1 3
GPU-based parallel algorithm design to efficiently manage
large-scale species distribution data (overlapped polygons)
Part 1: Extended quadtree to represent overlapped polygons (GeoInformatics’09 and ACMGIS’09)
Part 2: Efficient conversion between real-world geospatial polygons to quadtrees
Step 1:From polygons to scan-line segments. Step 2: from scan-line segments to quadtrees
Part 3: Query-driven visual exploration (ACMGIS’08 and ACMGIS’09)
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Background and Related Works

Polygon-rasterization on GPUS
 State-of-the-art: OpenGL GL_Polygon
 Problems







GPGPU comes to the rescue





Fix-function, proprietary, black-box
Does not support complex (e.g. concave) polygons – results may be incorrect
(although acceptable for display purposes)
GL_Polygon is much slower than GL_TRIANGLES
Require a hardware context to read back rasterization results
Accuracy is limited by screen resolution
Difficult to implement using graphics languages for GIS developers
Being able to use GPU parallel computing power
Using C/C++ languages is more intuitive
Directly generating spatial data structures can be more efficient (than using
rasterized images to construct quadtrees)
More client-server computing friendly
No previous works on polygon rasterization on GPGPUs for geospatial apps.
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Background and Related Works

Spatial Data structures on GPUs for computer graphics
applications



KD-Tree (Zhou et al 2008, Hou et al 2001), Octree (Zhou 2011)
They are designed to efficiently render triangles, not querying
polygons
Software rasterization of triangles



(Laine and Karras 2011), (Panntaleoni 2011), (Schwarz and Seidel
2011)
Results are encouraging when compared to hardware rasterization
(2-8x gap)
Again, they are deisgned for rasterizing/rendering triangles, not for
query polygons
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Background and Related Works

Geospatial Data Processing on GPUs
 Pre-GPGPU:
 Using
graphics data structures and primitives for spatial
selection and spatial join queries (Sun et al 2003)
 Difficult and unintuitive
 Post-GPGPU
 Spatial
similarity join (Lieberman et al 2008)
 Density-based spatial clustering (Bohm et al 2009)
 Min-Max quadtree for large-scale raster data (Zhang et al
2010)
 Decoding quad-tree encoded bitplane bitmaps of large-scale
raster data (Zhang et al 2011)
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
The Serial Scan-Line Fill Algorithm
For each scan line y from ymin to ymax
1. Compute the intersection points
with all edges
2. Sort the intersection points and
form the scan line segments
3. (Fill the raster cells in the scan
line segments)
End
Intersection points between scan line y=y’
and edge (x1,y1) and (x2,y2)
x’=(x1+(y-y1)/(y2-y1)*(x2-x1))
GDAL/GRASS codebases
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Polygon Rasterization on GPGPUs - Challenges
•Unique hardware characteristics (e.g. Nvidia Telsa C2050)
• large number of threads (1024 per SM, 14 SMs)
• limited shared memory: 48K per SM (shared by 1024 threads)
• limited registers: 32768 per SM, i.e., 32 per thread
•Need explicit shared memory management to make full utilization of
the memory hierarchy
•Parallelizing Scan-Line Fill Algorithm
•Mimicking CPU algorithm (assigning a polygon to a thread)
•Will NOT Work
•Uncoalesced accesses to global memory are extremely inefficient
•Insufficient registers and shared memory
•How to assign computing blocks and threads to scan-lines and polygon
edges?
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Polygon Rasterization on GPGPUs – Design
GPU Global Memory
L2
L1
SM1
SM2
…
SMn
•The GPU SMs are divided into 14*4
computing blocks
…
•A computing block has 256 threads
and processes one polygon
•All threads in a computing block
loop through scan lines
cooperatively
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Polygon Rasterization on GPGPUs – Design
3
c
4
1 2 3 4 5 6
1
b
X/Y
a
d
5
1 2 3 4 5 6
f
2
e
6
Global Memory
Shared Memory
For each scan line y from ymin to ymax
Intersection
X
O
O
X
O
Sorting
X
X
O
O
O
End
X/Y coordinates in shared
memory are re-used
(ymax-ymin-1) times
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Polygon Rasterization on GPGPUs – Sorting
Step 0
0 1 1 0
0 0 0 0 0 1 1 0
Step 1
0 0 0 0 0 1 2 1
Step 2
0 0 0 0 0 1 2 2
Step 3
0 0 0 0 0 1 2 2
__device__ inline ushort scan4(ushort num) {
__shared__ ushort ptr[2* MAX_PT];
ushort val=num; uint idx = threadIdx.x;
ptr[idx] = 0; idx += Tn;
Result of
ptr[idx] =num; SYNC
exclusive scan
val += ptr[idx - 1]; SYNC ptr[idx] = val; SYNC
val += ptr[idx - 2]; SYNC ptr[idx] = val; SYNC
val += ptr[idx - 4]; SYNC ptr[idx] = val; SYNC
…
val = ptr[idx - 1]; return val;
•GPGPUs are
extremely good at
sorting
•Sorting on shared
memory are
extremely fast
Benefits
•only true intersection
results are written back
to global memory
•Save GPU memory
footprint and I/O costs
}
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Experiments and Results

Data:
 NatureServe West Hemisphere birds speices distributions:
http://www.natureserve.org/getData/birdMaps.jsp
 4148 birds:
http://geoteci.engr.ccny.cuny.edu/geoteci/SPTestMap.html
 717,057 polygons, 1,199,799 rings
 78,929,697 vertices (1.3 G - shp files)
 Total number of scan-line/polygon edge intersections: 200+
billions
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Experiments and Results
Group #
1
2
3
4
5
Min # vertices
32
64
128
256
512
Max # vertices
64
128
256
512
1024
# Threads
64
128
256
512
1024
46509
23880
9666
5076
3146
CPU time (ms)
526
995
1803
4490
9387
GPU time (ms)
88
49
88
224
528
6.0X
20.1X
20.5X
20.0X
17.8X
# Polygons
Speedup
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Discussions - handling large polygons
•The current implementation can not process polygons whose number of
vertices are above a few thousands
•8n bytes for x coordinates
•8n bytes for y coordinates
•4n bytes for x coordinates of the intersections
•~100 extra bytes
•(20n+100)<48kn~2000 (using a whole SM as a computing block)
•We have limited the number of points to the number of threads
(1024) - having one thread process a few vertices is not scalable
•We need a better way to handle scalability
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Discussions - handling large polygons
Proposed Solution: chunking edge list,
computing separately and then assembling
1 2 3 4 5 6
X/Y
Chunking
1 2 3
4 5 6
Computing
(x1,y1) (x2,y2) (x3,y1)
assembling
(x1,y1) (x3,y1) (x2,y2)
Global Memory
shared Memory
(x4,y2)
(x4,y2)
Sorting using a
separate kernel
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Summary and Conclusion





Introduced A GPGPU accelerated software rasterization framework
to rasterize and index large-scale geospatial polygons
Provided A GPGPU based design and implementation of computing
intersection points
Achieved about 20X speedup for groups of polygons with vertices
between 64 and 1024 using the birds species distribution data in the
West Hemisphere that has about 3/4 million of polygons and more
than 78 millions of vertices
Discussed on extending the current implementation to support
polygons with arbitrarily large numbers of vertices by extensively
using efficient sorting
Work reported is preliminary - several important components in
realizing a dynamically integrated vector-raster data model for highperformance geospatial analysis on GPGPUs are still currently under
development.
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Future Work




Extend our current implementation to support large
polygons with arbitrary numbers of vertices
Implement the quadtree construction (step2) based
on the GPGPU computed scan-line segments
(CPU/GPU)
Perform a comprehensive performance comparison
with that of commercial spatial database indexing
Integrate with front end modules in spatial
databases (e.g., query parser and optimizer)
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
Q&A
jzhang@cs.ccny.cuny.edu
2011 Workshop on High Performance and Distributed Geographic Information Systems (HPDGIS’11)
19th ACM SIGSPATIAL GIS: Chicago, IL Nov 1—4, 2011
21
Download