HistoPyramid stream compaction and expansion Christopher Dyken1 * and Gernot Ziegler2 Advanced Computer Graphics / Vision Seminar TU Graz 23/10-2007 1 2 University of Oslo Max-Planck-Institut für Informatik, Saarbrücken Page 1 GPUs are highly parallel and perfect for all data local operations. However, re-arranging or selectively deleting data is difficult on GPUs. Stream compaction and expansion are such operations: Stream compaction For each element in input stream, let a predicate determine if the element should be discarded. Produce a compact output stream of remaining elements. Generalization: Stream compaction and expansion For each element in input stream, let a predicate determine each element’s multiplicity, i.e. how many times the element should be present in the output stream (0 =⇒ discard) Produce a compact output stream of all elements with a multiplicity greater than 0 Page 2 Why bother with stream compaction and expansion? I Feature extraction: I I Point cloud generation: I I I Save GPU → CPU bandwith. Emulate/Offload geometry-shader: I I Generate a set of points on an implicitly defined surface. Compaction of intermediate results: I I get a compact list of all locations satisfying a criterion. Our HP-based Marching Cubes implementation does not need GS, currently even outperforms GS-based approaches. Sparse matrix extraction. ... Data re-organization is an active field of research (see GPU Gems II and III, etc.) Page 3 The HistoPyramid algorithm I The input stream is laid out over a N × N grid (=tex2D). I Each input element are subjected to a predicate function: I I I I The output of the predicate func forms the HP base layer. I I I I the HP is a mipmap-pyramid of partial sums. the HP pyramid is built in log2 (N) passes. top element of HP: number of elements in output. Then, iterate over output elements: I I I count = 0 =⇒ discard from output stream. count = 1 =⇒ keep in output stream. count > 1 =⇒ repeat in output stream. extraction of an element is done in log2 (N) texture lookups. Each input element can have multiple copies in output. No data transfer from GPU to CPU. Page 4 Overview 12 2 2 4 4 0 0 2 0 Input image Predicate Bucket count 1 1 0 2 1 1 0 2 0 0 2 0 HistoPyramid HP-builder (2,1) (2,2) (5,1) (5,2) (0,4) (1,5) (2,6) (3,6) (7,4) (6,5) (4,6) (5,6) Point list Extractor Page 5 Predicate function I For each input element, determine output stream count. I I For different predicates, several base layers can be built: I I I I count is often binary (1/0) per base layer, one HP will be built - in parallel. predicates may overlap if needed! NV40: 4×RGBA = 16 predicates, G80: 8×RGBA = 32 predicates. Example: extract list of edge pixels (Laplace + threshold): Input data Laplace + threshold | {z } HP base level predicate Page 6 HistoPyramid builder I mipmap-generation, but with sum instead of average: I in effect, each cell counts elements in its sub-pyramid: 1 1 0 1 1 0 2 0 0 1 0 0 1 0 1 0 3 2 3 1 Level 1, 2 × 2 9 Level 2, 1 × 1 Base level, 4 × 4 I I top element: total number of elements in base layer =⇒ output size. Example: HP of the Lena edge-pixels (red = nonzero count): Page 7 Pointlist extractor Input: output index used as key index. [0,1) ∅ Input: Key indices 0 1 2 3 4 5 6 7 8 Output: texcoords & clone ix [1,2) ∅ intervals 6 1 4 9 L2 1 1 0 1 3 2 1 0 1 0 3 1 0 2 0 1 1 0 0 0 L1 [0,0],0 [1,0],0 [0,1],0 L0 1 [3,0],0 [2,1],0 [1,2],0 [1,2],1 [0,3],0 [3,2],0 Notice: multiplicity from base layer: =1 =⇒ copy once. >1 =⇒ copy multiple times. [0,3) [3,5) [5,8) [8,9) intervals ∅ [0,2) [2,3) intervals Page 8 ∅ The Marching Cubes Algorithm The input to the algorithm is an M 3 grid of scalar values. Examine groups of 2 × 2 × 2 voxels (MC cell). Check if MC cell’s corners are inside/outside iso-level. 8 corners, inside/outside =⇒ 256 classes. Each MC class: combination of edges that pierce iso-surface. Use table with geometry for MC classes, with all possible triangulations of the edge intersections (figure). Determine exact edge-surface intersections and emit corresponding triangles. Notice: Effectively a stream compaction/expansion process! Page 9 HistoPyramid Marching Cubes Scalar field texture Vertex count texture HistoPyramid texture Triangulation table texture Enumeration VBO Build HP base HP reduce Vertex count readback Render geometry Iso-level Start new frame Update scalar field For each level Input: A stream of (M − 1)3 MC-cells (2x2x2 voxels grouped). Predicate: Samples and determines MC class via inside/outside-state of MC cell corners, then writes number of required vertices for MC geometry to base layer. HistoPyramid: Top element gives total number of vertices in the iso-surface (3× the number of triangles). Extraction: Use output index to traverse HP, determine corresponding input element (i.e. which MC cell), remainder tells which edge intersection this vertex correspond to, determine edge intersection and emit position. Page 10 Datasets used in the performance analysis Bunny CThead MRbrain Bonsai Aneurism Cayley Page 11 Cayley Aneurism Bonsai mrbrain Cthead Bunny Performance of HistoPyramid Marching Cubes Model 255x255x255 127x127x127 63x63x63 31x31x31 255x255x128 127x127x63 63x63x31 31x31x15 255x255x128 127x127x63 63x63x31 31x31x15 255x255x255 127x127x127 63x63x63 31x31x31 255x255x255 127x127x127 63x63x63 31x31x31 255x255x255 127x127x127 63x63x63 31x31x31 MC cells 16581375 2048383 250047 29791 8323200 1016127 123039 14415 8323200 1016127 123039 14415 16581375 2048383 250047 29791 16581375 2048383 250047 29791 16581375 2048383 250047 29791 Density 3.2% 5.6% 9.1% 13.6% 3.7% 6.3% 9.6% 14.5% 5.8% 7.4% 10.0% 14.9% 3.0% 5.1% 6.7% 8.2% 1.6% 2.1% 3.7% 6.8% 0.9% 1.9% 3.9% 8.1% 6600GT HP-VS – 5.4 (2.6) 4.0 (16.1) 2.5 (82.8) – 5.4 (5.3) 3.7 (29.9) 2.3 (161.3) – 4.6 (4.5) 3.5 (28.6) 2.2 (155.0) – 5.9 (2.9) 5.4 (21.5) 4.1 (136.8) – 12.6 (6.1) 9.1 (36.2) 4.5 (149.7) – 13.5 (6.6) 8.5 (33.9) 3.7 (123.8) 7800GT HP-VS – 11.8 (5.8) 8.5 (34.1) 5.0 (167.9) 16.3 (2.0) 11.6 (11.5) 7.7 (62.2) 4.5 (311.5) 10.5 (1.3) 9.9 (9.7) 7.4 (60.0) 4.3 (300.9) – 13.0 (6.3) 11.4 (45.5) 8.0 (268.8) – 29.1 (14.2) 19.2 (76.7) 8.6 (289.1) – 31.2 (15.2) 17.9 (71.6) 7.3 (245.8) 8800GTX HP-VS 538.6 (32.5) 309.5 (151.1) 163.4 (653.5) 25.5 (857.0) 437.6 (53.0) 288.1 (283.6) 97.3 (791.0) 12.9 (896.4) 309.0 (37.4) 257.7 (263.6) 96.8 (786.5) 12.7 (879.7) 560.8 (33.8) 329.8 (161.0) 186.5 (745.9) 25.1 (843.0) 892.5 (53.8) 557.6 (272.2) 190.5 (761.9) 25.0 (839.3) 1112.3 (67.1) 581.3 (283.8) 198.0 (791.9) 25.8 (866.2) 8800GTX NV-SDK10 – – 28.3 (113.2) 21.9 (734.0) – – 25.3 (205.9) 17.1 (1187.0) – – 26.4 (214.9) 18.2 (1257.4) – – 28.9 (115.6) 24.0 (804.6) – – 32.9 (131.5) 25.5 (856.6) – – 32.1 (128.5) 24.7 (827.9) Numbers in million voxels processed per second (Parentheses: MC runs per second - framerate). Page 12 References: I C. Dyken, G. Ziegler, C. Theobalt, H.-P. Seidel, GPU Marching Cubes on Shader Model 3.0 and 4.0, MPI-I-2007-4-006, Max-Planck-Institut für Informatik, 2007 I C. Dyken, J. Seland, and M.Reimers, Real-Time GPU Silhouette Refinement using adaptively blended Bézier patches, to appear in Graphics Forum, 2007 I I Ihrke, G. Ziegler, A. Tevs, C. Theobalt, M. Magnor, H.-P. Seidel Eikonal Rendering: Efficient Light Transport in Refractive Objects to appear in ACM Trans. on Graphics (Siggraph’07), 2007. I G. Ziegler, A. Tevs, C. Theobalt, H.-P. Seidel, ”GPU Point List Generation through Histogram Pyramids”, MPI-I-2006-4-002, Max-Planck-Institut für Informatik, 2006. Page 13