A Survey of General-Purpose Computation on Graphics Hardware John Owens

advertisement
A Survey of General-Purpose
Computation on Graphics
Hardware
John Owens
David Luebke
University of California, Davis
University of Virginia
with Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, Tim Purcell
Motivation: The Potential of GPGPU
• In short:
• The power and flexibility of GPUs makes them an
attractive platform for general-purpose computation
• Example applications range from in-game physics
simulation to conventional computational science
• Goal: make the inexpensive power of the GPU
available to developers as a sort of computational
coprocessor
2
Problems: Difficult To Use
• GPUs designed for & driven by video games
• Programming model unusual
• Programming idioms tied to computer graphics
• Programming environment tightly constrained
• Underlying architectures are:
• Inherently parallel
• Rapidly evolving (even in basic feature set!)
• Largely secret
• Can’t simply “port” CPU code!
3
STAR Goals
• Detailed & useful survey of general-purpose
computing on graphics hardware
• Hardware and software developments behind GPGPU
• Building blocks: techniques for mapping generalpurpose computation to the GPU
• Applications: important applications of GPGPU
• A comprehensive GPGPU bibliography (current
through Summer 2005…)
4
NVIDIA GeForce 6800 3D Pipeline
Vertex
Triangle Setup
Z-Cull
Shader Instruction Dispatch
Fragment
L2 Tex
Fragment Crossbar
Memory
Partition
Memory
Partition
Composite
Memory
Partition
Memory
Partition
Courtesy Nick Triantos, NVIDIA
5
Programming a GPU for Graphics
• Application specifies
geometry  rasterized
• Each fragment is
shaded
w/
SIMD program
• Shading can use values
from texture memory
• Image can be used as
texture on future
passes
6
Programming a GPU for GP Programs
• Draw a screen-sized
quad  stream
• Run a SIMD kernel over
each fragment
• “Gather” is permitted
from texture memory
• Resulting buffer can be
treated as texture on
next pass
7
Feedback
• Each algorithm step depend on
the results of previous steps
• Each time step depends on the
results of the previous time
step
8
CPU-GPU Analogies
.
CPU
.
.
Grid[i][j]= x;
.
.
.
Array Write
GPU
=
Render to Texture
9
CPU-GPU Analogies
CPU
Stream / Data Array =
Memory Read
=
GPU
Texture
Texture Sample
10
Kernels
CPU
GPU
Kernel / loop body / algorithm step = Fragment Program
11
Scatter vs. Gather
• Grid communication
• Grid cells share information
• Gather
•Indirect read from memory ( x = a[i])
•Naturally maps to a texture fetch
•Used to access data structures and data streams
• Scatter
•Indirect write to memory (a[i] = x)
•Difficult to emulate:
•Usually done on CPU
12
Computational Resources Inventory
• Programmable parallel processors
• Vertex & Fragment pipelines
• Rasterizer
• Mostly useful for interpolating addresses (texture
coordinates) and per-vertex constants
• Texture unit
• Read-only memory interface
• Render to texture
• Write-only memory interface
13
Vertex Processor
• Fully programmable (SIMD / MIMD)
• Processes 4-vectors (RGBA / XYZW)
• Capable of scatter but not gather
• Can change the location of current vertex
• Cannot read info from other vertices
• Can only read a small constant memory
• Latest GPUs: Vertex Texture Fetch
• Random access memory for vertices
• Gather (But not from the vertex stream itself)
14
Fragment Processor
•
•
•
•
Fully programmable (SIMD)
Processes 4-component vectors (RGBA / XYZW)
Random access memory read (textures)
Capable of gather but not scatter
• RAM read (texture fetch), but no RAM write
• Output address fixed to a specific pixel
• Paper details ways to synthesize scatter
• Typically more useful than vertex processor
• More fragment pipelines than vertex pipelines
• Direct output (fragment processor is at end of pipeline)
15
Building Blocks & Applications
GPGPU Building Blocks
• The STAR covers the following fundamental
techniques & computational building blocks:
• Flow control (a very fundamental building block)
• Stream operations
• Data structures
• Differential equations & linear algebra
• Data queries
• I’ll discuss each in a bit more detail
17
Flow control
• Surprising number of issues on GPUs
• Main themes:
• Avoid branching when possible
• Move branching earlier in the pipeline when possible
• Largely SIMD  coherent branching most efficient
• Mechanisms:
• Rasterized geometry
• Z-cull
• Occlusion query
18
Domain Decomposition
• Avoid branches where outcome is fixed
• One region is always true, another false
• Separate FPs for each region, no branches
• Example:
boundaries
19
Flat 3D Textures
20
Flat 3D Textures
• Advantages
• One texture update per operation
• Better use of GPU parallelism
• Non-power-of-two Textures
• Quick simulation preview
• Disadvantage
• Must compute texture offsets
21
Staggered Simulation
• Non-interactive application:
• Simulate as fast as possible
• Frame rate suffers
20ms
22
Staggered Simulation
• Interactive frame rate!
• Simulation still proceeds pretty fast
10 20ms
23
Z-Cull
• In early pass, modify depth buffer
• Write depth=0 for pixels that should not be modified
by later passes
• Write depth=1 for rest
• Subsequent passes
• Enable depth test (GL_LESS)
• Draw full-screen quad at z=0.5
• Only pixels with previous depth=1 will be processed
• Can also use early stencil test
• Note: Depth replace disables ZCull
24
Pre-computation
• Pre-compute anything that will not change
every iteration!
• Example: arbitrary boundaries
• When user draws boundaries, compute texture
containing boundary info for cells
• Reuse that texture until boundaries modified
• Combine with Z-cull for higher performance!
25
Stream Operations
• Several stream operations in GPGPU toolkit:
• Map: apply a function to every element in a stream
• Reduce: use a function to reduce a stream to a
smaller stream (often 1 element)
• Scatter/gather: indirect read and write
• Filter: select a subset of elements in a stream
• Sort: order elements in a stream
• Search: find a given element, nearest neighbors, etc
26
Simple Fire Effect
Blur and scroll upward
Trails of blur emerge from
bright source ‘embers’ at the
bottom
VC VA
VB
VD
27
Cellular Automata
• Great for generating noise and other animated patterns to use in
•
blending
Game of Life in a Pixel Shader
• Cell ‘state’ relative to the rules is computed at each texel
The Rules
For a space that is 'populated':
Each cell with one or no neighbors dies,
as if by loneliness.
• Dependent texture read
Each cell with four or more neighbors dies,
• State accesses ‘rules’ table, which is a texture
Each cell with two or three neighbors survives.
• Highly complex rules are easy!
as if by overpopulation.
For a space that is 'empty' or 'unpopulated'
Each cell with three neighbors becomes populated
28
Lattice Computations
• Greg’s been talking about them
• How far can we take them?
• Anything we can describe with discrete PDE equations!
• Discrete in space and time
• Also other approximations
29
Approximate Methods
• Several different approximations
• Cellular Automata (CA)
• Coupled Map Lattice (CML)
• Lattice-Boltzmann Methods (LBM)
• Greg talked about CA
• I’ll talk about CML
30
Coupled Map Lattice
• Mapping:
• Continuous state  lattice nodes
• Coupling:
• Nodes interact with each other to produce new state
according to specified rules
31
Coupled Map Lattice
• CML introduced by Kaneko (1980s)
• Used CML to study spatio-temporal chaos
• Others adapted CML to physical simulation:
• Boiling [Yanagita 1992]
• Convection [Yanagita 1993]
• Clouds [Yanagita 1997; Miyazaki 2001]
• Chemical reaction-diffusion [Kapral ‘93]
• Saltation (sand ripples / dunes) [ Nishimori ‘93]
• And more
32
CML vs. CA
• CML extends cellular automata (CA)
CA
CML
SPACE
Discrete
Discrete
TIME
Discrete
Discrete
STATE
Discrete
Continuous
33
CML vs. CA
• Continuous state is more useful
• Discrete: physical quantities difficult
• Must filter over many nodes to get “real”
values
• Continuous: physical quantities easy
• Real physical values at each node
• Temperature, velocity, concentration, etc.
34
Rules?
• CML updated via simple, local rules
• Simple: same rule applied at every cell (SIMD)
• Local: cells updated according to some function of
their neighbors’ state
35
Example: Buoyancy
• Used in temperature-based boiling
simulation
• At each cell:
• If neighbors to left and right of cell are warmer,
raise the cell’s temperature
• If neighbors are cooler, lower its temperature
36
CML Operations
• Implement operations as building blocks for use in
multiple simulations
• Diffusion
• Buoyancy (2 types)
• Latent Heat
• Advection
• Viscosity / Pressure
• Gray-Scott Chemical Reaction
• Boundary Conditions
• User interaction (drawing)
• Transfer function (color gradient)
37
Anatomy of a CML operation
• Neighbor Sampling
• Select and read values, v, of nearby cells
• Computation on Neighbors
• Compute f(v) for each sample (f can be arbitrary
computation)
• Combine new values (arithmetic)
• Store new values back in lattice
38
Graphics Hardware
• Why use it?
• Speed: up to 25x speedup in our sims
• GPU perf. grows faster than CPU perf.
• Cheap: GeForce 4 Ti 4200 < $130
• Load balancing in complex applications
• Why not use it?
• Low precision computation (not anymore!)
• Difficult to program (not anymore!)
39
Hardware Implementation (GF4)
40
Simulating the world
• Simulate a wide variety of phenomena on
GPUs
• Anything we can describe with discrete PDEs or
approximations of PDEs
Download