Interactive Ray Tracing: From bad joke to old news David Luebke

advertisement
Interactive Ray Tracing:
From bad joke to old news
David Luebke
University of Virginia
Besides Parallelization

Besides parallelizing the algorithm, what
else can we do to accelerate ray tracing?
– Amortize the cost of shooting rays
– Use ray tracing selectively
Amortize the cost of rays

The Render Cache
– Work by Bruce Walters, currently at Cornell;
also by Reinhard et al (Utah)
– Basic idea:



Cache ray “hits” as shaded 3D points
Reproject points for new viewpoint
Now many pixels already have color!


Shoot rays for newly uncovered pixels
Shoot rays to update stale pixels
– Show demo(?)
– Web page w/ good examples, source:
http://www.graphics.cornell.edu/research/interactive/rendercache/
Amortize the cost:
Tole et al.

Tole et al. extend these ideas to path
tracing
– Cache ray hits in object space as Gouraudshaded vertices
– Designed for very slow sampling schemes
(full bidirectional path tracing)


Pick pixels to sample carefully
Use OpenGL hardware to display current solution
as it is gradually updated
– Show the Tole video
Amortize The Cost:
Frameless Rendering

Eliminate frames altogether
– If you can render 1/3 of the pixels in a
vertical retrace period:
Double buffering displays a new frame after 3
vertical refreshes
 Single buffering causes horizontal tearing artifacts
 Frameless rendering updates pixels as soon as they
are computed…
…but computes them in a randomized order to avoid
coherent tearing artifacts

– Show the Utah video
Shoot Rays Selectively

Use ray tracing selectively to augment a
traditional interactive pipeline
– Ex: use rays for shadows only
– Ex: Use ray tracing to calculate corrective
textures where necessary (e.g., shiny
objects)
Summary So Far

Interactive ray tracing is a reality
– Parker et al. 1999 (SGI supercomputer)
– Wald et al. 2001 (Cluster of PCs)

Why IRT?
– Complex/realistic shading
– Big data
– Decoupled sampling
Summary So Far

How IRT?
– Ray tracing is embarrassingly parallel


Field of VAX/Cray joke
But memory coherence is a problem
– Brute force: shared-memory supercomputer
– Slightly smarter: distributed cluster

Fan-in, latency, model sharing are issues
– Amortize cost

Cache/reuse samples, frameless rendering
– Use selectively

Shadows only, corrective textures
Moving to Hardware

Next topic: moving ray tracing to the
GPU
– Why do this?

Two papers:
– Ray Engine (Carr et al., U. Illinois)
– Ray Tracing On Programmable Graphics
Hardware (Purcell et al., Stanford)

I stole most of the following slides from this talk
Related Work:
The Ray Engine


Nathan Carr, Jesse Hall, John Hart
(University of Illinois)
Basic idea: use the fragment hardware!
– Ray intersection is a crossbar:

Intersect a bunch of rays with a bunch of triangles,
keep closest hit on each ray
– Triangle rasterization is a crossbar:

Intersect a bunch of pixels with a bunch of
triangles, keep closest hit at each pixel
(a)
(b)
(c)
Naive: intersect all rays w/ all polys
Acceleration structures break crossbar
grid up into a sparse block structure,
but blocks are still dense crossbars
Result: a series of points on the
crossbar, max 1 per ray (closest wins)
(a)
(b)
Each pixel potentially intersected with
each poly
Modern hardware
Ray Engine

Map ray casting crossbar to rasterization
crossbar
– Distribute rays across pixels


Ray-orgins texture
Ray-directions texture
– Broadcast a stream of triangles as the vertex
data interpolated across screen-filling quads


Quad color  Triangle id
Quad multi-texture coords:

Triangle vertices a,b, normal n, edges ab, ac, bc
– Output:

Color = triangle id, alpha = intersect, z = t value
Ray Engine



Bulk of ray tracing computation is
intersection
CPU handles bounding volume traversal,
recursion, etc
GPU does ray-intersection on bundles of
rays and triangles handed to it by CPU
– NV_FENCE to keep both humming

Sometimes the CPU should intersect
rays!
Why Ray Tracing?


Global illumination
Good shadows!
– Doom 3 will be using
shadow volumes

Expensive!
– Shadow maps are hard
to use and prone to
artifacts

Efficient ray tracing
based shadows could
be the next killer
feature for GPUs
Doom 3 [id Software]
Why Ray Tracing?

Output-sensitive
algorithm
– Sublinear in depth
complexity


Interactive on clusters of
PCs [Wald et al. 2001] and
supercomputers [Parker et
al. 1999 ]
Selective sampling
– Frameless rendering
[Bishop et al. 1994]
– Render Cache
[Walter et al. 1995]
– Shading Cache
[Tole et al. 2002]
Power Plant
[Wald et al. 2001]
Beyond Moore’s Law
NVIDIA Historicals
Season
Product
MT/s
2H97
Riva 128
5
-
100
-
1H98
Riva ZX
5
1.0
100
1.0
2H98
Riva TNT
5
1.0
180
3.2
1H99
Riva TNT2
8
1.0
333
3.4
2H99
GeForce
15
3.5
480
2.1
1H00
GeForce2 GTS
25
2.8
666
1.9
2H00
GeForce2 Ultra
31
1.5
1000
2.3
1H01
GeForce3
40
1.7
3200
10.2
1H02
GeForce4
65
1.6
4800
1.5
Courtesy of Kurt Akeley
Yr rate
MF/s
1.8
Yearly growth well above Moore’s Law (1.5)
Yr rate
2.4
Graphics Pipeline
Application
Fragment
Input
Command
Geometry
Rasterization
Textures
Fragment
Program
Registers
Texture
Fragment
Fragment
Output
Display
Traditional Pipeline
Programmable Fragment Pipeline
Contributions

Map complete ray tracer onto GPU
– Ray tracing generally thought to be incompatible with
the traditional graphics pipeline



Abstract programmable fragment processor as a
stream processor
Map ray tracing to streaming computation
Show that streaming GPU-based ray tracer is
competitive with CPU-based ray tracer
Assumptions



Static scenes
Triangle primitives only
Uniform grid acceleration structure
Stream Programming
Model
Programmable fragment processor is
essentially a stream processor
globals

input
record
stream
kernel
Kernels and streams
–
–
–
–
Stream is a set of data records
globals
Kernels operate on records
Streams connect kernels together
Kernels can read global memory
kernel
output
record
stream
Streaming Ray Tracer (Simplified)
Camera
Generate Eye
Rays
rays
Grid
Triangles
Traverse
Acceleration
Structure
ray-voxel pairs
Intersect
Triangles
hits
Materials
Shade Hits
and Generate
Shading Rays
pixels
Eye Ray Generator
Camera
Generate Eye
Rays
rays
Camera
Screen
Scene
Traverser
rays
Grid
Traverse
Acceleration
Structure
ray-voxel pairs
Camera
Screen
Scene
Intersector
ray-voxel pairs
Triangles
Intersect
Triangles
hits
Camera
Screen
Scene
Intersection Code
ray-voxel pairs
Triangles
Intersect
Triangles
hits
float4 Intersect( float3 ro, float3 rd,
int listpos, float4 h ) {
float tri_id = texture( listpos, trilist );
float3 v0 = texture( tri_id, v0 );
float3 v1 = texture( tri_id, v1 );
float3 v2 = texture( tri_id, v2 );
float3 edge1 = v1 – v0;
float3 edge2 = v2 – v0;
float3 pvec = Cross( rd, edge2 );
float det = Dot( edge1, pvec );
float inv_det = 1/det;
float3 tvec = ro – v0;
float u = Dot( tvec, pvec ) * inv_det;
float3 qvec = Cross( tvec, edge1 );
float v = Dot( rd, qvec ) * inv_det;
float t = Dot( edge2, qvec ) * inv_det;
// determine if valid hit by checking
// u,v > 0 and u+v < 1
// set hit data into h based on valid hit
return float4( {t,u,v,id} );
}
Ray Tracing on a GPU

Store scene data in texture memory
– Dependent texturing is key

Multipass rendering for flow control
– Branching would eliminate this need
Scene in Texture Memory
Uniform Grid
3D Luminance
Texture
vox0 vox1 vox2 vox3 vox4 vox5
0
4
11
38
…
voxM
564
Triangle List
1D Luminance
Texture
vox0
0
21
216
Triangles
3x 1D RGB
Textures
3
vox2
1
3
7
tri0
v0 xyz
tri1
xyz
tri2
xyz
tri3
xyz
tri4
xyz
tri5
xyz
…
triN
xyz
v1 xyz
xyz
xyz
xyz
xyz
xyz
…
xyz
xyz
xyz
xyz
xyz
xyz
xyz
…
xyz
v2
…
Texture As Memory

Currently limited in size - 128MB
– About 3M triangles @ 36 bytes per triangle

Uniform grid
– Maps naturally to 3D textures
– Requires 4 levels of dependent texture lookups

1D textures limited in length
– Emulate larger address space with 2D textures

Want integer addressing – not floating point
– Efficient access without interpolation

Integer arithmetic
Streaming Flow Control
Application
and Geometry
Stages
Rasterization
Fragments
(Input Stream)
Texture
(Globals)
Fragment Program
(Kernel)
Fragment Program Output
(Output Stream)
Multiple Rendering Passes
Pass 1
Generate
Eye Rays
Draw quad
Rasterize
Multiple Rendering Passes
Pass 1
Generate
Eye Rays
Run fragment program
Multiple Rendering Passes
Pass 1
Generate
Eye Rays
Save to offscreen buffer
(rays)
Multiple Rendering Passes
Pass 2
Traverse
Draw quad
Rasterize
Multiple Rendering Passes
Pass 2
Traverse
Restore
(rays)
Run
fragment
program
Multiple Rendering Passes
Pass 2
Traverse
Save to
offscreen
buffer
(ray voxel pr)
Streaming Ray Tracer
Camera
Generate Eye
Rays
Grid
Traverse
Acceleration
Structure
Triangles
Intersect
Triangles
Materials
Shade Hits
and Generate
Shading Rays
Multipass Optimization

Reduce the number of passes
– Choose to traverse or intersect based on
work to be done for each type of pass



Connection Machine ray tracer [Delany 1988]
Intersect once 20% of active rays need intersecting
Make each pass less expensive
– Most passes involve only a few rays
– Early fragment kill based on fragment mask

Saves compute and bandwidth
Scene Statistics
v
14.41
26.11
81.29
130.7
93.93
t
2.52
40.46
34.07
47.90
13.88
s
0.44
1.00
0.96
0.97
0.82
P
2443
1198
1999
2835
1085
v – average number of voxels a ray pierces
t – average triangles a ray intersects
s – average number of shading evaluations per ray
P – number of rendering passes
C = R*(Cr + v*Cv + t*Ct + s*Cs) + R*P*Cmask
Performance Estimates

Pentium III 800 MHz CPU implementation
– 20M intersections/s [Wald et al. 2001]

Simulated performance
– 2G instructions/s and 8GB/s bandwidth
– Instruction limited

56M intersections/s
– Nearly bandwidth limited


222M intersections/s
Streaming ray tracing is compute
limited!
Demo Analysis

Prototype Performance (ATI R300)
– 500K – 1.4M raycast/s
– 94M intersections/s
– Only three weeks of coding effort

ATI Radeon 8500 GPU (R200)
– 114M intersections/s [Carr et al. 2002]
– Fixed point operations
– Only ray-triangle intersection kernel
Summary



Programmable GPU is a stream processor
Ray tracing can be a streaming
computation
Complete ray tracer can map onto the
GPU
– Ray tracing generally thought to be
incompatible with the traditional graphics
pipeline

Streaming GPU-based ray tracer is
competitive with CPU-based ray tracer
Architectural Results

Fragment mask proposed for efficient multipass
– Stream buffer eliminates this need



Stream data should not go through standard
texture cache
Triangles cache well for primary rays, secondary
less so
Branching architecture
– More cache coherence than the multipass architecture
for scene data
– Reduces memory bandwidth for stream data
– But has its own costs…
Final Thoughts

Ray tracing maps into current GPU
architecture
– Does not require fundamentally different
hardware
– Hybrid algorithms possible

What else can the GPU do?
– Given you can do ray tracing, you can do
anything
– Fluid flow, molecular dynamics, etc.

GPU performance increase will continue
to outpace CPU performance increase
Download