slides

advertisement
Real-time Ray Tracing on GPU
with BVH-based Packet Traversal
Stefan Popov, Johannes Günther, HansPeter Seidel, Philipp Slusallek
Stefan Popov
High Performance GPU Ray Tracing
Background
 GPUs attractive for ray tracing
 High computational power
 Shading oriented architecture
 GPU ray tracers





Carr – the ray engine
Purcell – Full ray tracing on the GPU, based on grids
Ernst – KD trees with parallel stack
Carr, Thrane & Simonsen – BVH
Foley, Horn, Popov – KD trees - stackless traversal
Stefan Popov
High Performance GPU Ray Tracing
Motivation
 So far
 Interactive RT on GPU, but
 Limited model size
 No dynamic scene support
 The G80 – new approach to the GPU
 High performance general purpose processor with
graphics extensions
 PRAM architecture
 BVH allow for
 Dynamic/deformable scenes
 Small memory footprint
 Goal: Recursive ordered traversal of BVH on the G80
Stefan Popov
High Performance GPU Ray Tracing
GPU Architecture (G80)
 Multi-threaded scalar architecture
…
 16 (multi-)cores
…
Thread 32
Thread 32
 Off-chip memory ops
 Instruction dependencies
 4 or 16 cycles to issue instr.
Thread 1
Thread 1
IP
 Threads cover latencies
Multi-Core 16
Multi-Core 1
IP
 12K HW threads
…
Chunk Pool
Chunk Pool
Thread 1
…
Thread 1
…
Thread 1
…
Thread 1
Thread 32
Thread 32
Thread 32
Thread 32
Thread 1
…
Thread 1
…
Thread 1
…
Thread 1
Thread` 32
Thread 32
Thread` 32
Thread 32
 8-wide SIMD
 128 scalar cores in total
 Cores process threads in 32 wide
SIMD chunks
Stefan Popov
…
High Performance GPU Ray Tracing
…
GPU Architecture (G80)
 Scalar register file (8K)
 Shared memory (16KB)
 On-chip, 0 cycle latency
 On-board memory (768MB)
 Large latency (~ 200 cycles)
 R/W from within thread
 Un-cached
Multi-Core 16
Thread 1
Registers
Thread 32
Registers
…
Shared
Memory
 Partitioned among running
threads
Multi-Core 1
…
L2 Cache (128KB)
On-board memory
 Read-only L2 cache (128KB)
 On chip, shared among all
threads
Stefan Popov
High Performance GPU Ray Tracing
Programming the G80
 CUDA
 C based language with parallel extensions
 GPU utilization at 100% only if
 Enough threads are present (>> 12K)
 Every thread uses less than 10 registers and 5
words (32 bit) of shared memory
 Enough computations per transferred word of data
 Bandwidth << computational power
 Adequate memory access pattern to allow read
combining
Stefan Popov
High Performance GPU Ray Tracing
Performance Bottlenecks
 Efficient per-thread stack implementation
 Shared memory too small – will limit parallelism
 On-board memory – uncached
 Need enough computations between stack ops
 Efficient memory access pattern
 Use texture caches
 However, only few words of cache / thread
 Read successive memory locations in successive
threads of a chunk
 Single roundtrip to memory (read combining)
 Cover latency with enough computations
Stefan Popov
High Performance GPU Ray Tracing
Ray Tracing on the G80
 Map each ray to one thread
 Enough threads to keep the GPU busy
 Recursive ray tracing
 Use per-thread stack stored on on-board memory
 Efficient, since enough computations are present
 But how to do the traversal ?
 Skip pointers (Thrane) – no ordered traversal
 Geometric images (Carr) – single mesh only
 Shared stack traversal
Stefan Popov
High Performance GPU Ray Tracing
SIMD Packet Traversal of BVH
 Traverse a node with the whole packet
 At an internal node:
 Intersect all rays with both children and determine
traversal order
 Push far child (if any) on a stack and descend to the
near one with the packet
 At a leaf:
 Intersect all rays with contained geometry
 Pop next node to visit from the stack
Stefan Popov
High Performance GPU Ray Tracing
PRAM Basics
 The PRAM model
false
true
 Implicitly synchronized
processors (threads)
 Shared memory
between all processors
 Basic PRAM operations
 Parallel OR in O(1)
 Parallel reduction in
O(log N)
true
false
true
12
32
11
+
44
20
9
+
11
9
11
9
+
64
Stefan Popov
false
20
High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH
 The G80 – PRAM machine on chunk level
 Map packet  chunk, ray  thread
 Threads behave as in the single ray traversal
 At leaf: Intersect with geometry. Pop next node
from stack
 At node: Decide which children to visit and in what
order. Push far child
 Difference:
 How rays choose which node to visit first
 Might not be the one they want to
Stefan Popov
High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH
 Choose child traversal order
 PRAM OR to determine if all rays agree on visiting
the same node first
 The result is stored in shared memory
 In case of divergence: choose child with more ray
candidates
 Use PRAM SUM on +/- 1 for each thread, -1  left node
 Look at result’s sign
 Guarantees synchronous traversal of BVH
Stefan Popov
High Performance GPU Ray Tracing
PRAM Packet Traversal of BVH
 Stack:
 Near & far child – the same for all threads => store
once
 Keep stack in shared memory. Only few bits per
thread!
 Only Thread 0 does all stack ops.
 Reading data:
 All threads work with the same node / triangle
 Sequential threads bring in sequential words
 Single load operation. Single round trip to memory
 Implementable in CUDA
Stefan Popov
High Performance GPU Ray Tracing
Results
Scene
#Tris
FPS Primary 1K2
FPS Shading 1K2
Conference
282K
16 (19)
6.1
Conference (with ropes)
282K
16.7
6.7
Soda Hall
2.1M
13.6 (16.2)
5.7
Power Plant – Outside
12.7M
6.4
2.9
Power Plant – Furnace
12.7M
–
1.9
Stefan Popov
High Performance GPU Ray Tracing
Analysis
 Coherent branch decisions / memory access
 Small footprint of the data structure
 Can trace up to 12 million triangle models
 Program becomes compute bound
 Determined by over/under-clocking the core/memory
 No frustums required
 Good for secondary rays, bad for primary
 Can use rasterization for primary rays
 Implicit SIMD – easy shader programming
 Running on a GPU – shading “for free”
Stefan Popov
High Performance GPU Ray Tracing
Dynamic Scenes
 Update parts / whole BVH and geometry on
GPU
 Use GPU for RT and CPU for BVH construction /
refitting
 Construct BVH using binning
 Similar to Wald RT07 / Popov RT06
 Bin all 3 dimensions using SIMD
 Results in > 10% better trees
 Measured as SAH quality, not FPS
 Speed loss is almost negligible
Stefan Popov
High Performance GPU Ray Tracing
Results
Scene
#Tris
Exact SAH
Binning 1D
Binning 3D
Speed
Speed
Quality
Speed
Quality
Conference
282K
0.8 s
0.15 s
92.5%
0.2 s
99.4%
Soda Hall
2.1M
8.78 s
1.28 s
103.5%
1.59 s
101.6%
Power Plant
12.7M
119 s
6.6 s
99.4%
8.1 s
100.5%
Boeing
348M
5605 s
572 s
94.8%
667 s
98.1 %
Stefan Popov
High Performance GPU Ray Tracing
Conclusions
 New recursive PRAM BVH traversal algorithm
 Very well suited for the new generation of GPUs
 No additional pre-computed data required
 First GPU ray tracer to handle large models
 Previous implementations were limited to < 300K
 Can handle dynamic scenes
 By using the CPU to update the geometry / BVH
Stefan Popov
High Performance GPU Ray Tracing
Future Work
 More features
 Shaders, adaptive anti-aliasing, …
 Global illumination
 Code optimizations
 Current implementation uses too many registers
Stefan Popov
High Performance GPU Ray Tracing
Thank you!
Stefan Popov
High Performance GPU Ray Tracing
CUDA Hello World
__global__ void addArrays(int *arr1, int *arr2)
{
unsigned t = threadIdx.x + blockIdx.x * blockDim.x;
arr1[t] += arr2[t];
}
int main()
{
int *inArr1 = malloc(4194304), *inArr2 = malloc(4194304);
int *ta1, *ta2;
cudaMalloc((void**)&ta1, 4194304); cudaMalloc((void**)&ta2, 4194304);
for(int i = 0; i < 4194304; i++)
{ inArr1[i] = rand(); inArr2[i] = rand(); }
cudaMemcpy(ta1, inArr1, 4194304, cudaMemcpyHostToDevice);
cudaMemcpy(ta2, inArr2, 4194304, cudaMemcpyHostToDevice);
addArrays<<<dim3(4194304 / 512, 1, 1),
dim3(512, 1, 1)>>>(ta1, ta2);
cudaMemcpy(inArr1, ta1, 4194304, cudaMemcpyDeviceToHost);
for(int i = 0; i < 4194304; i++)
printf("%d ", inArr1[i]);
return 0;
}
Stefan Popov
High Performance GPU Ray Tracing
Download