Interactive Ray Tracing: From bad joke to old news David Luebke University of Virginia Besides Parallelization Besides parallelizing the algorithm, what else can we do to accelerate ray tracing? – Amortize the cost of shooting rays – Use ray tracing selectively Amortize the cost of rays The Render Cache – Work by Bruce Walters, currently at Cornell; also by Reinhard et al (Utah) – Basic idea: Cache ray “hits” as shaded 3D points Reproject points for new viewpoint Now many pixels already have color! Shoot rays for newly uncovered pixels Shoot rays to update stale pixels – Show demo(?) – Web page w/ good examples, source: http://www.graphics.cornell.edu/research/interactive/rendercache/ Amortize the cost: Tole et al. Tole et al. extend these ideas to path tracing – Cache ray hits in object space as Gouraudshaded vertices – Designed for very slow sampling schemes (full bidirectional path tracing) Pick pixels to sample carefully Use OpenGL hardware to display current solution as it is gradually updated – Show the Tole video Amortize The Cost: Frameless Rendering Eliminate frames altogether – If you can render 1/3 of the pixels in a vertical retrace period: Double buffering displays a new frame after 3 vertical refreshes Single buffering causes horizontal tearing artifacts Frameless rendering updates pixels as soon as they are computed… …but computes them in a randomized order to avoid coherent tearing artifacts – Show the Utah video Shoot Rays Selectively Use ray tracing selectively to augment a traditional interactive pipeline – Ex: use rays for shadows only – Ex: Use ray tracing to calculate corrective textures where necessary (e.g., shiny objects) Summary So Far Interactive ray tracing is a reality – Parker et al. 1999 (SGI supercomputer) – Wald et al. 2001 (Cluster of PCs) Why IRT? – Complex/realistic shading – Big data – Decoupled sampling Summary So Far How IRT? – Ray tracing is embarrassingly parallel Field of VAX/Cray joke But memory coherence is a problem – Brute force: shared-memory supercomputer – Slightly smarter: distributed cluster Fan-in, latency, model sharing are issues – Amortize cost Cache/reuse samples, frameless rendering – Use selectively Shadows only, corrective textures Moving to Hardware Next topic: moving ray tracing to the GPU – Why do this? Two papers: – Ray Engine (Carr et al., U. Illinois) – Ray Tracing On Programmable Graphics Hardware (Purcell et al., Stanford) I stole most of the following slides from this talk Related Work: The Ray Engine Nathan Carr, Jesse Hall, John Hart (University of Illinois) Basic idea: use the fragment hardware! – Ray intersection is a crossbar: Intersect a bunch of rays with a bunch of triangles, keep closest hit on each ray – Triangle rasterization is a crossbar: Intersect a bunch of pixels with a bunch of triangles, keep closest hit at each pixel (a) (b) (c) Naive: intersect all rays w/ all polys Acceleration structures break crossbar grid up into a sparse block structure, but blocks are still dense crossbars Result: a series of points on the crossbar, max 1 per ray (closest wins) (a) (b) Each pixel potentially intersected with each poly Modern hardware Ray Engine Map ray casting crossbar to rasterization crossbar – Distribute rays across pixels Ray-orgins texture Ray-directions texture – Broadcast a stream of triangles as the vertex data interpolated across screen-filling quads Quad color Triangle id Quad multi-texture coords: Triangle vertices a,b, normal n, edges ab, ac, bc – Output: Color = triangle id, alpha = intersect, z = t value Ray Engine Bulk of ray tracing computation is intersection CPU handles bounding volume traversal, recursion, etc GPU does ray-intersection on bundles of rays and triangles handed to it by CPU – NV_FENCE to keep both humming Sometimes the CPU should intersect rays! Why Ray Tracing? Global illumination Good shadows! – Doom 3 will be using shadow volumes Expensive! – Shadow maps are hard to use and prone to artifacts Efficient ray tracing based shadows could be the next killer feature for GPUs Doom 3 [id Software] Why Ray Tracing? Output-sensitive algorithm – Sublinear in depth complexity Interactive on clusters of PCs [Wald et al. 2001] and supercomputers [Parker et al. 1999 ] Selective sampling – Frameless rendering [Bishop et al. 1994] – Render Cache [Walter et al. 1995] – Shading Cache [Tole et al. 2002] Power Plant [Wald et al. 2001] Beyond Moore’s Law NVIDIA Historicals Season Product MT/s 2H97 Riva 128 5 - 100 - 1H98 Riva ZX 5 1.0 100 1.0 2H98 Riva TNT 5 1.0 180 3.2 1H99 Riva TNT2 8 1.0 333 3.4 2H99 GeForce 15 3.5 480 2.1 1H00 GeForce2 GTS 25 2.8 666 1.9 2H00 GeForce2 Ultra 31 1.5 1000 2.3 1H01 GeForce3 40 1.7 3200 10.2 1H02 GeForce4 65 1.6 4800 1.5 Courtesy of Kurt Akeley Yr rate MF/s 1.8 Yearly growth well above Moore’s Law (1.5) Yr rate 2.4 Graphics Pipeline Application Fragment Input Command Geometry Rasterization Textures Fragment Program Registers Texture Fragment Fragment Output Display Traditional Pipeline Programmable Fragment Pipeline Contributions Map complete ray tracer onto GPU – Ray tracing generally thought to be incompatible with the traditional graphics pipeline Abstract programmable fragment processor as a stream processor Map ray tracing to streaming computation Show that streaming GPU-based ray tracer is competitive with CPU-based ray tracer Assumptions Static scenes Triangle primitives only Uniform grid acceleration structure Stream Programming Model Programmable fragment processor is essentially a stream processor globals input record stream kernel Kernels and streams – – – – Stream is a set of data records globals Kernels operate on records Streams connect kernels together Kernels can read global memory kernel output record stream Streaming Ray Tracer (Simplified) Camera Generate Eye Rays rays Grid Triangles Traverse Acceleration Structure ray-voxel pairs Intersect Triangles hits Materials Shade Hits and Generate Shading Rays pixels Eye Ray Generator Camera Generate Eye Rays rays Camera Screen Scene Traverser rays Grid Traverse Acceleration Structure ray-voxel pairs Camera Screen Scene Intersector ray-voxel pairs Triangles Intersect Triangles hits Camera Screen Scene Intersection Code ray-voxel pairs Triangles Intersect Triangles hits float4 Intersect( float3 ro, float3 rd, int listpos, float4 h ) { float tri_id = texture( listpos, trilist ); float3 v0 = texture( tri_id, v0 ); float3 v1 = texture( tri_id, v1 ); float3 v2 = texture( tri_id, v2 ); float3 edge1 = v1 – v0; float3 edge2 = v2 – v0; float3 pvec = Cross( rd, edge2 ); float det = Dot( edge1, pvec ); float inv_det = 1/det; float3 tvec = ro – v0; float u = Dot( tvec, pvec ) * inv_det; float3 qvec = Cross( tvec, edge1 ); float v = Dot( rd, qvec ) * inv_det; float t = Dot( edge2, qvec ) * inv_det; // determine if valid hit by checking // u,v > 0 and u+v < 1 // set hit data into h based on valid hit return float4( {t,u,v,id} ); } Ray Tracing on a GPU Store scene data in texture memory – Dependent texturing is key Multipass rendering for flow control – Branching would eliminate this need Scene in Texture Memory Uniform Grid 3D Luminance Texture vox0 vox1 vox2 vox3 vox4 vox5 0 4 11 38 … voxM 564 Triangle List 1D Luminance Texture vox0 0 21 216 Triangles 3x 1D RGB Textures 3 vox2 1 3 7 tri0 v0 xyz tri1 xyz tri2 xyz tri3 xyz tri4 xyz tri5 xyz … triN xyz v1 xyz xyz xyz xyz xyz xyz … xyz xyz xyz xyz xyz xyz xyz … xyz v2 … Texture As Memory Currently limited in size - 128MB – About 3M triangles @ 36 bytes per triangle Uniform grid – Maps naturally to 3D textures – Requires 4 levels of dependent texture lookups 1D textures limited in length – Emulate larger address space with 2D textures Want integer addressing – not floating point – Efficient access without interpolation Integer arithmetic Streaming Flow Control Application and Geometry Stages Rasterization Fragments (Input Stream) Texture (Globals) Fragment Program (Kernel) Fragment Program Output (Output Stream) Multiple Rendering Passes Pass 1 Generate Eye Rays Draw quad Rasterize Multiple Rendering Passes Pass 1 Generate Eye Rays Run fragment program Multiple Rendering Passes Pass 1 Generate Eye Rays Save to offscreen buffer (rays) Multiple Rendering Passes Pass 2 Traverse Draw quad Rasterize Multiple Rendering Passes Pass 2 Traverse Restore (rays) Run fragment program Multiple Rendering Passes Pass 2 Traverse Save to offscreen buffer (ray voxel pr) Streaming Ray Tracer Camera Generate Eye Rays Grid Traverse Acceleration Structure Triangles Intersect Triangles Materials Shade Hits and Generate Shading Rays Multipass Optimization Reduce the number of passes – Choose to traverse or intersect based on work to be done for each type of pass Connection Machine ray tracer [Delany 1988] Intersect once 20% of active rays need intersecting Make each pass less expensive – Most passes involve only a few rays – Early fragment kill based on fragment mask Saves compute and bandwidth Scene Statistics v 14.41 26.11 81.29 130.7 93.93 t 2.52 40.46 34.07 47.90 13.88 s 0.44 1.00 0.96 0.97 0.82 P 2443 1198 1999 2835 1085 v – average number of voxels a ray pierces t – average triangles a ray intersects s – average number of shading evaluations per ray P – number of rendering passes C = R*(Cr + v*Cv + t*Ct + s*Cs) + R*P*Cmask Performance Estimates Pentium III 800 MHz CPU implementation – 20M intersections/s [Wald et al. 2001] Simulated performance – 2G instructions/s and 8GB/s bandwidth – Instruction limited 56M intersections/s – Nearly bandwidth limited 222M intersections/s Streaming ray tracing is compute limited! Demo Analysis Prototype Performance (ATI R300) – 500K – 1.4M raycast/s – 94M intersections/s – Only three weeks of coding effort ATI Radeon 8500 GPU (R200) – 114M intersections/s [Carr et al. 2002] – Fixed point operations – Only ray-triangle intersection kernel Summary Programmable GPU is a stream processor Ray tracing can be a streaming computation Complete ray tracer can map onto the GPU – Ray tracing generally thought to be incompatible with the traditional graphics pipeline Streaming GPU-based ray tracer is competitive with CPU-based ray tracer Architectural Results Fragment mask proposed for efficient multipass – Stream buffer eliminates this need Stream data should not go through standard texture cache Triangles cache well for primary rays, secondary less so Branching architecture – More cache coherence than the multipass architecture for scene data – Reduces memory bandwidth for stream data – But has its own costs… Final Thoughts Ray tracing maps into current GPU architecture – Does not require fundamentally different hardware – Hybrid algorithms possible What else can the GPU do? – Given you can do ray tracing, you can do anything – Fluid flow, molecular dynamics, etc. GPU performance increase will continue to outpace CPU performance increase