GPU-Assisted Path Tracing Matthias Boindl Christian Machacek Institute of Computer Graphics and Algorithms Vienna University of Technology Motivation: Why Path Tracing? Physically based Nature provides the reference image Parallelizable Sublinear in #objects Conceptually simple Can lead to a clean implementation But: fast implementation on GPUs not trivial 2 Outline Path tracing intro Main steps of the algorithm Mapping the algorithm to the GPU How to organize code into kernels When to launch kernels How to pass data between kernels Acceleration structures Focus on bounding volume hierarchies Christian Machacek 3 Path Tracing Intro Like ray tracing, except it… …supports arbitrary BRDFs …is stochastic: at each bounce, the new direction is decided randomly Convergence video From Pharr, Humphreys: PBRT, 2nd ed. (2010) 4 Path Tracing Pseudocode while image not converged r = new ray from eye through next pixel do i = closest intersection of r with scene if no i: break if i is on a light source: c = c + throughput * emission randomly pick new direction and create reflected ray r evaluate BRDF at i update throughput while path throughput high enough From Pharr, Humphreys: PBRT, 2nd ed. (2010) 5 Path Tracing Pseudocode while image not converged r = new ray from eye through next pixel do i = closest intersection of r with scene if no i: break if i is on a light source: c = c + throughput * emission randomly pick new direction and create reflected ray r evaluate BRDF at i Execution Time update throughput while path throughput high enough logic 15% ray cast 56% From Pharr, Humphreys: PBRT, 2nd ed. (2010) 6 materials 25% new path 4% Megakernel Execution Divergence From Bikker (2013) 7 Solution: Wavefront Path Tracing Separate, specialized kernels Keep a pool of ~1 million paths alive Work for next stage goes into kernel-specific, compact queues (=4MB index arrays) https://mediatech.aalto.fi/~samuli/ 8 Results Performance Execution times (ms / 1M path segments) Christian Machacek 9 Limitations and Possible Improvements Higher memory requirements (+200 MB) Kernel launch overhead Dynamic parallelism on GK110 Use an outer scheduling kernel No CPU round trip Launch independent stages side-by-side CUDA streams So kernels with little work don’t hog the GPU Christian Machacek 10 Acceleration Structures Find nearest intersection in O(log N) Space partitioning vs. object partitioning Hybrid methods exist Matthias Boindl 11 Performance For interactive rendering, compromise Traversal performance (build quality) Construction/Update time Update or rebuild from scratch Adapt to GPU environment Memory architecture Parallel execution Matthias Boindl 12 State of the Art Tero Karras and Timo Aila. 2013. Fast parallel construction of high-quality bounding volume hierarchies. In Proceedings of the 5th HighPerformance Graphics Conference (HPG '13). ACM, New York, NY, USA, 89-99. Matthias Boindl 13 Close the Performance Gap Matthias Boindl 14 Basic Idea Fast construction of simple BVH Generate leaf for each triangle Reduce SAH cost by modifying tree Matthias Boindl 15 Treelets Allow local tree modification ABCF are leaves, DEG are internal nodes Matthias Boindl 16 Treelet Construction Find root: parallel bottom-up traversal Start with leaves Use atomic counter at conjunctions Ensures all children have been processed Build treelet Add both children Pick children with highest surface area Fixed size: 7 leaf nodes Matthias Boindl 17 Rearrange Treelet Minimize treelet root node surface area Naive implementation: test each permutation Better: dynamic programming Caching of best intermediate results Start with leaves, then pairs, then triplets, … Suboptimal subtree construction avoided Parallelizable as well Matthias Boindl 18 Results Gap closed Matthias Boindl 19 Results Speed/Quality tradeoff Matthias Boindl 20 Conclusion Use specialized kernels Lower execution divergence (Better use of instruction cache) (Fewer registers used simultaneously) Construct acceleration structures quickly But not too quickly Matthias Boindl 21 Thanks for your attention! Institute of Computer Graphics and Algorithms Vienna University of Technology Results Speed/Quality tradeoff Matthias Boindl 23 Logic Kernel Does not need a queue, operates on all paths If shadow ray was unblocked, add light contribution Find material or light source the ray hits Place path into proper material queue Russian roulette If path terminated, accumulate to image Place path into new path queue Sample light sources (aka next event estim.) Christian Machacek 24 New Path Kernel Generate a new image-space sample Generate camera ray Place it into extension ray cast queue Initialize path state Throughput Pixel position etc. Christian Machacek 25 Material Kernels Generate incoming direction Evaluate light contribution based on light sample generated in the logic kernel We haven’t cast the shadow ray yet! For MIS: p(light sample) from the BSDF Discard BSDF stack Queue extension ray (shadow ray) Christian Machacek 26 Ray Cast Kernels Extension rays Find first intersection against scene geometry Store hit data into path state Shadow rays Blocked or not? Christian Machacek 27