Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes Günther, HansPeter Seidel, Philipp Slusallek Stefan Popov High Performance GPU Ray Tracing Background GPUs attractive for ray tracing High computational power Shading oriented architecture GPU ray tracers Carr – the ray engine Purcell – Full ray tracing on the GPU, based on grids Ernst – KD trees with parallel stack Carr, Thrane & Simonsen – BVH Foley, Horn, Popov – KD trees - stackless traversal Stefan Popov High Performance GPU Ray Tracing Motivation So far Interactive RT on GPU, but Limited model size No dynamic scene support The G80 – new approach to the GPU High performance general purpose processor with graphics extensions PRAM architecture BVH allow for Dynamic/deformable scenes Small memory footprint Goal: Recursive ordered traversal of BVH on the G80 Stefan Popov High Performance GPU Ray Tracing GPU Architecture (G80) Multi-threaded scalar architecture … 16 (multi-)cores … Thread 32 Thread 32 Off-chip memory ops Instruction dependencies 4 or 16 cycles to issue instr. Thread 1 Thread 1 IP Threads cover latencies Multi-Core 16 Multi-Core 1 IP 12K HW threads … Chunk Pool Chunk Pool Thread 1 … Thread 1 … Thread 1 … Thread 1 Thread 32 Thread 32 Thread 32 Thread 32 Thread 1 … Thread 1 … Thread 1 … Thread 1 Thread` 32 Thread 32 Thread` 32 Thread 32 8-wide SIMD 128 scalar cores in total Cores process threads in 32 wide SIMD chunks Stefan Popov … High Performance GPU Ray Tracing … GPU Architecture (G80) Scalar register file (8K) Shared memory (16KB) On-chip, 0 cycle latency On-board memory (768MB) Large latency (~ 200 cycles) R/W from within thread Un-cached Multi-Core 16 Thread 1 Registers Thread 32 Registers … Shared Memory Partitioned among running threads Multi-Core 1 … L2 Cache (128KB) On-board memory Read-only L2 cache (128KB) On chip, shared among all threads Stefan Popov High Performance GPU Ray Tracing Programming the G80 CUDA C based language with parallel extensions GPU utilization at 100% only if Enough threads are present (>> 12K) Every thread uses less than 10 registers and 5 words (32 bit) of shared memory Enough computations per transferred word of data Bandwidth << computational power Adequate memory access pattern to allow read combining Stefan Popov High Performance GPU Ray Tracing Performance Bottlenecks Efficient per-thread stack implementation Shared memory too small – will limit parallelism On-board memory – uncached Need enough computations between stack ops Efficient memory access pattern Use texture caches However, only few words of cache / thread Read successive memory locations in successive threads of a chunk Single roundtrip to memory (read combining) Cover latency with enough computations Stefan Popov High Performance GPU Ray Tracing Ray Tracing on the G80 Map each ray to one thread Enough threads to keep the GPU busy Recursive ray tracing Use per-thread stack stored on on-board memory Efficient, since enough computations are present But how to do the traversal ? Skip pointers (Thrane) – no ordered traversal Geometric images (Carr) – single mesh only Shared stack traversal Stefan Popov High Performance GPU Ray Tracing SIMD Packet Traversal of BVH Traverse a node with the whole packet At an internal node: Intersect all rays with both children and determine traversal order Push far child (if any) on a stack and descend to the near one with the packet At a leaf: Intersect all rays with contained geometry Pop next node to visit from the stack Stefan Popov High Performance GPU Ray Tracing PRAM Basics The PRAM model false true Implicitly synchronized processors (threads) Shared memory between all processors Basic PRAM operations Parallel OR in O(1) Parallel reduction in O(log N) true false true 12 32 11 + 44 20 9 + 11 9 11 9 + 64 Stefan Popov false 20 High Performance GPU Ray Tracing PRAM Packet Traversal of BVH The G80 – PRAM machine on chunk level Map packet chunk, ray thread Threads behave as in the single ray traversal At leaf: Intersect with geometry. Pop next node from stack At node: Decide which children to visit and in what order. Push far child Difference: How rays choose which node to visit first Might not be the one they want to Stefan Popov High Performance GPU Ray Tracing PRAM Packet Traversal of BVH Choose child traversal order PRAM OR to determine if all rays agree on visiting the same node first The result is stored in shared memory In case of divergence: choose child with more ray candidates Use PRAM SUM on +/- 1 for each thread, -1 left node Look at result’s sign Guarantees synchronous traversal of BVH Stefan Popov High Performance GPU Ray Tracing PRAM Packet Traversal of BVH Stack: Near & far child – the same for all threads => store once Keep stack in shared memory. Only few bits per thread! Only Thread 0 does all stack ops. Reading data: All threads work with the same node / triangle Sequential threads bring in sequential words Single load operation. Single round trip to memory Implementable in CUDA Stefan Popov High Performance GPU Ray Tracing Results Scene #Tris FPS Primary 1K2 FPS Shading 1K2 Conference 282K 16 (19) 6.1 Conference (with ropes) 282K 16.7 6.7 Soda Hall 2.1M 13.6 (16.2) 5.7 Power Plant – Outside 12.7M 6.4 2.9 Power Plant – Furnace 12.7M – 1.9 Stefan Popov High Performance GPU Ray Tracing Analysis Coherent branch decisions / memory access Small footprint of the data structure Can trace up to 12 million triangle models Program becomes compute bound Determined by over/under-clocking the core/memory No frustums required Good for secondary rays, bad for primary Can use rasterization for primary rays Implicit SIMD – easy shader programming Running on a GPU – shading “for free” Stefan Popov High Performance GPU Ray Tracing Dynamic Scenes Update parts / whole BVH and geometry on GPU Use GPU for RT and CPU for BVH construction / refitting Construct BVH using binning Similar to Wald RT07 / Popov RT06 Bin all 3 dimensions using SIMD Results in > 10% better trees Measured as SAH quality, not FPS Speed loss is almost negligible Stefan Popov High Performance GPU Ray Tracing Results Scene #Tris Exact SAH Binning 1D Binning 3D Speed Speed Quality Speed Quality Conference 282K 0.8 s 0.15 s 92.5% 0.2 s 99.4% Soda Hall 2.1M 8.78 s 1.28 s 103.5% 1.59 s 101.6% Power Plant 12.7M 119 s 6.6 s 99.4% 8.1 s 100.5% Boeing 348M 5605 s 572 s 94.8% 667 s 98.1 % Stefan Popov High Performance GPU Ray Tracing Conclusions New recursive PRAM BVH traversal algorithm Very well suited for the new generation of GPUs No additional pre-computed data required First GPU ray tracer to handle large models Previous implementations were limited to < 300K Can handle dynamic scenes By using the CPU to update the geometry / BVH Stefan Popov High Performance GPU Ray Tracing Future Work More features Shaders, adaptive anti-aliasing, … Global illumination Code optimizations Current implementation uses too many registers Stefan Popov High Performance GPU Ray Tracing Thank you! Stefan Popov High Performance GPU Ray Tracing CUDA Hello World __global__ void addArrays(int *arr1, int *arr2) { unsigned t = threadIdx.x + blockIdx.x * blockDim.x; arr1[t] += arr2[t]; } int main() { int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304); int *ta1, *ta2; cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304); for(int i = 0; i < 4194304; i++) { inArr1[i] = rand(); inArr2[i] = rand(); } cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice); cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice); addArrays<<<dim3(4194304 / 512, 1, 1), dim3(512, 1, 1)>>>(ta1, ta2); cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost); for(int i = 0; i < 4194304; i++) printf("%d ", inArr1[i]); return 0; } Stefan Popov High Performance GPU Ray Tracing