Next-Generation Stanford Shading System

advertisement
A Multigrid Solver for Boundary Value Problems
Using Programmable Graphics Hardware
Nolan Goodnight Cliff Woolley Gregory Lewin
David Luebke Greg Humphreys
University of Virginia
Graphics Hardware 2003
July 26-27 – San Diego, CA
General-Purpose GPU Programming

Why do we port algorithms to the GPU?

How much faster can we expect it to be, really?

What is the challenge in porting?
Case Study
Problem: Implement a Boundary Value
Problem (BVP) solver using the GPU
Could benefit an entire class of scientific and
engineering applications, e.g.:

Heat transfer

Fluid flow
Related Work

Krüger and Westermann: Linear Algebra Operators
for GPU Implementation of Numerical Algorithms

Bolz et al.: Sparse Matrix Solvers on the GPU:
Conjugate Gradients and Multigrid

Very similar to our system

Developed concurrently

Complementary approach
Driving problem: Fluid mechanics sim
Problem domain is a warped disc:
regular
regular
grid
grid
BVPs: Background

Boundary value problems are sometimes governed
by PDEs of the form:
L = f
L is some operator
 is the problem domain
f is a forcing function (source term)
Given L and f, solve for .
BVPs: Example
Heat Transfer

Find a steady-state temperature distribution T in a
solid of thermal conductivity k with thermal source S

This requires solving a Poisson equation of the form:
k2T = -S

This is a BVP where L is the Laplacian operator 2
All our applications require a Poisson solver.
BVPs: Solving

Most such problems cannot be solved analytically

Instead, discretize onto a grid to form a set of linear
equations, then solve:

Direct elimination

Gauss-Seidel iteration

Conjugate-gradient

Strongly implicit procedures

Multigrid method
Multigrid method

Iteratively corrects an approximation to the solution

Operates at multiple grid resolutions

Low-resolution grids are used to correct higherresolution grids recursively

Very fast, especially for large grids: O(n)
Multigrid method

Use coarser grid levels to recursively correct an
approximation to the solution

Algorithm:

smooth

residual

restrict


recurse
interpolate
1/16 1/8
1/2
1/4
1 1/16
1/4
 = Li - f
1/2
1/2
1 1/8
1/8
1 1/4
-4
1
1/16 1/8
1/4
1/4
1/2
1 1/16
Implementation
For each step of the algorithm:

Bind as texture maps the buffers that contain the
necessary data

Set the target buffer for rendering

Activate a fragment program that performs the
necessary kernel computation

Render a grid-sized quad with multitexturing
source
buffer
texture
fragment
program
render
render
target
target
buffer
buffer
Optimizing the Solver

Detect steady-state natively on GPU

Minimize shader length

Special-case whenever possible

Avoid context-switching
Optimizing the Solver: Steady-state

How to detect convergence?

L1 norm - average error

L2 norm – RMS error (common in visual sim)

L norm – max error (common in sci/eng apps)

Can use occlusion query!
secs to steady state
vs. grid size
Optimizing the Solver: Shader length

Minimize number of registers used

Vectorize as much as possible

Use the rasterizer to perform computations of
linearly-varying values

Pre-compute invariants on CPU
shader
original fp
fastpath fp fastpath vp
smooth
79-6-1
20-4-1
12-2
residual
45-7-0
16-4-0
11-1
restrict
66-6-1
21-3-0
11-1
interpolate
93-6-1
25-3-0
13-2
Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program
to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching
fast path, no
boundaries
slow path with
boundaries
Optimizing the Solver: Special-case

Fast-path vs. slow-path

write several variants of each fragment program
to handle boundary cases

eliminates conditionals in the fragment program

equivalent to avoiding CPU inner-loop branching
secs per v-cycle
vs. grid size
Optimizing the Solver: Context-switching

Find best packing data of multiple grid levels
into the pbuffer surfaces
Optimizing the Solver: Context-switching

Find best packing data of multiple grid levels
into the pbuffer surfaces
Optimizing the Solver: Context-switching

Find best packing data of multiple grid levels
into the pbuffer surfaces
Optimizing the Solver: Context-switching

Remove context switching

Can introduce operations with undefined results:
reading/writing same surface

Why do we need to do this?

Can we get away with it?

What about superbuffers?
Data Layout

Performance:
secs to steady state
vs. grid size
Data Layout

Possible additional vectorization:
Stacked domain

Compute 4 values at
a time

Requires source,
residual, solution
values to be in
different buffers

Complicates
boundary
calculations

Adds setup and
teardown overhead
Results: CPU vs. GPU

Performance:
secs to steady state
vs. grid size
Conclusions
What we need going forward:


Superbuffers

or: Universal support for multiple-surface pbuffers

or: Cheap context switching
Developer tools

Debugging tools

Documentation

Global accumulator

Ever increasing amounts of precision, memory

Textures bigger than 2048 on a side
Acknowledgements
Hardware
General-purpose
GPU

David Kirk

Mark Harris

Matt Papakipos

Aaron Lefohn
Support

Ian Buck
Driver

Nick Triantos

Pat Brown

Stephen Ehmann
Fragment
Programming

James Percy

Matt Pharr
Funding

NSF Award #0092793
Download