CS 446: Real-Time Rendering
& Game Technology
David Luebke
University of Virginia
• Today: Matthew Rodgers
• That’s it!
2 David Luebke
• New job: NVIDIA Research
• New e-mail: dave@luebke.us
3 David Luebke
• GPGPU: “General Purpose GPU Computing”
• Active, exciting area of research and development
• A personal interest of mine
• Following slides taken from a recent talk I co-presented in Dublin
– Accompanying this paper
4 David Luebke
A Survey of General-Purpose
Computation on Graphics
Hardware
John Owens
University of California, Davis
David Luebke
University of Virginia with Naga Govindaraju, Mark Harris, Jens Kr ü ger, Aaron Lefohn, Tim Purcell
Introduction
• The graphics processing unit ( GPU ) on commodity video cards has evolved into an extremely flexible and powerful processor
• Programmability
• Precision
• Power
• GPGPU : an emerging field seeking to harness
GPUs for general-purpose computation
6
Motivation: Computational Power
• GPUs are fast…
• 3.0 GHz dual-core Pentium4: 24.6 GFLOPS
• NVIDIA GeForceFX 7800: 165 GFLOPs
• 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s
• ATI Radeon X850 XT Platinum Edition: 37.8 GB/s
• GPUs are getting faster, faster
• CPUs: 1.4
× annual growth
• GPUs: 1.7
× (pixels) to 2.3
× (vertices) annual growth
7
Motivation: Computational Power
Thanks to Ian Buck 8
Motivation: Computational Power
Thanks to John Owens 9
Motivation: Flexible and Precise
• Modern GPUs are deeply programmable
• Programmable pixel, vertex, video engines
• Solidifying high-level language support
• Modern GPUs support high precision
• 32 bit floating point throughout the pipeline
• High enough for many (not all) applications
10
Motivation: The Potential of GPGPU
• In short:
• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation
• Example applications range from in-game physics simulation to conventional computational science
• Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor
11
Problems: Difficult To Use
• GPUs designed for & driven by video games
• Programming model unusual
• Programming idioms tied to computer graphics
• Programming environment tightly constrained
• Underlying architectures are:
• Inherently parallel
• Rapidly evolving (even in basic feature set!)
• Largely secret
• Can’t simply “port” CPU code!
12
Problems: Not A Panacea
• GPUs are fast because they are specialized
• Poorly suited to sequential, “pointer-chasing” code
• Missing support for some basic functionality
• E.g. integers, bitwise operations, indexed write
• More on limitations & difficulties later
13
STAR Goals
• Detailed & useful survey of general-purpose computing on graphics hardware
• Hardware and software developments behind GPGPU
• Building blocks : techniques for mapping generalpurpose computation to the GPU
• Applications : important applications of GPGPU
• A comprehensive GPGPU bibliography (current through Summer 2005…)
14
Architectural Considerations
The past: 1987
20 MIPS CPU
1987
[courtesy Anant Agarwal]
16
The future: 2007
1 Billion Transistors
2007
[courtesy Anant Agarwal]
17
Today’s VLSI Capability: Keys to High Performance
0.5mm
64-bit FPU
(to scale)
50pJ/FLOP
1 clock
90nm Chip
$200
1GHz
12mm
[courtesy Pat Hanrahan]
18
For High Performance, We Must …
0.5mm
64-bit FPU
(to scale)
1. Exploit
50pJ/FLOP
Ample
Computation!
1 clock
90nm Chip
$200
1GHz
12mm
[courtesy Pat Hanrahan]
19
For High Performance, We Must …
0.5mm
64-bit FPU
(to scale)
1. Exploit
50pJ/FLOP
Ample
Computation!
1 clock
90nm Chip
$200
1GHz
2. Require
Efficient
Communication!
12mm
[courtesy Pat Hanrahan]
20
Stream Programming Abstraction
• Let’s think about our problem in a new way
• Goal: SW programming model that matches today’s VLSI
• Streams
• Collection of data records stream kernel
• All data is expressed in streams stream • Kernels
• Inputs/outputs are streams
• Perform computation on streams
• Can be chained together
21
Why Streams?
• Ample computation by exposing parallelism
• Streams expose data parallelism
• Multiple stream elements can be processed in parallel
• Pipeline (task) parallelism
• Multiple tasks can be processed in parallel
• Kernels yield high arithmetic intensity
• Efficient communication
• Producer-consumer locality
• Predictable memory access pattern
• Optimize for throughput of all elements, not latency of one
• Processing many elements at once allows latency hiding
22
GPU: Special-Purpose Graphics Hardware
• Task-parallel organization
• Assign each task to processing unit
• Hardwire each unit to specific task - huge performance advantage!
• Provides ample computation resources
• Efficient communication patterns
• Dominant graphics architecture
[ATI Flipper – 51M T]
23
The Rendering Pipeline
Application
Compute 3D geometry
Make calls to graphics API
Geometry
Transform geometry from 3D to
2D (in parallel)
Rasterization
Composite
GPU
Generate fragments from 2D geometry (in parallel)
Combine fragments into image
24
The Programmable Rendering Pipeline
Application
Compute 3D geometry
Make calls to graphics API
Geometry
(Vertex)
Transform geometry from 3D to
2D; vertex programs
Rasterization
(Fragment)
Composite
GPU
Generate fragments from 2D geometry; fragment programs
Combine fragments into image
25
NVIDIA GeForce 6800 3D Pipeline
Vertex
Triangle Setup
Shader Instruction Dispatch Z-Cull
Fragment
L2 Tex
Fragment Crossbar
Composite
Memory
Partition
Memory
Partition
Memory
Partition
Memory
Partition
Courtesy Nick Triantos, NVIDIA
26
Characteristics of Graphics Apps
• Lots of arithmetic
• Lots of parallelism
• Simple control
• Multiple stages
• Feed forward pipelines
• Latency-tolerant / deep pipelines
Application
Command
Geometry
Rasterization
Fragment
• What other applications have these characteristics?
Composite
Display
27
Today’s Graphics Pipeline
Application
Command
Geometry
Rasterization
Fragment
Composite
Display
• Graphics is well suited to:
• The stream programming model
• Stream hardware organizations
• GPUs are a commodity stream processor!
• What if we could apply these techniques to more generalpurpose problems?
• GPUs should excel at tasks that require:
• Ample computation
• Regular computation
• Efficient communication
28
Programming a GPU for Graphics
• Application specifies geometry rasterized
• Each fragment is shaded w/
SIMD program
• Shading can use values from texture memory
• Image can be used as texture on future passes
29
Programming a GPU for GP Programs
• Draw a screen-sized quad stream
• Run a SIMD kernel over each fragment
• “Gather” is permitted from texture memory
• Resulting buffer can be treated as texture on next pass
30
Feedback
• Each algorithm step depend on the results of previous steps
• Each time step depends on the results of the previous time step
31
CPU-GPU Analogies
GPU
.
.
Grid[i][j]= x;
.
.
.
Array Write = Render to Texture
32
CPU-GPU Analogies
CPU GPU
Stream / Data Array = Texture
Memory Read = Texture Sample
33
Kernels
CPU GPU
Kernel / loop body / algorithm step = Fragment Program
34
Scatter vs. Gather
• Grid communication
• Grid cells share information
35
Computational Resources Inventory
• Programmable parallel processors
• Vertex & Fragment pipelines
• Rasterizer
• Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants
• Texture unit
• Read-only memory interface
• Render to texture
• Write-only memory interface
36
Vertex Processor
• Fully programmable (SIMD / MIMD)
• Processes 4-vectors (RGBA / XYZW)
• Capable of scatter but not gather
• Can change the location of current vertex
• Cannot read info from other vertices
• Can only read a small constant memory
• Latest GPUs: Vertex Texture Fetch
• Random access memory for vertices
• Gather (But not from the vertex stream itself)
37
Fragment Processor
• Fully programmable (SIMD)
• Processes 4-component vectors (RGBA / XYZW)
• Random access memory read (textures)
• Capable of gather but not scatter
• RAM read (texture fetch), but no RAM write
• Output address fixed to a specific pixel
• Paper details ways to synthesize scatter
• Typically more useful than vertex processor
• More fragment pipelines than vertex pipelines
• Direct output (fragment processor is at end of pipeline)
38
Building Blocks & Applications
GPGPU Building Blocks
• The STAR covers the following fundamental techniques & computational building blocks:
• Flow control (a very fundamental building block)
• Stream operations
• Data structures
• Differential equations & linear algebra
• Data queries
• I’ll discuss each in a bit more detail
40
Flow control
• Surprising number of issues on GPUs
• Main themes:
• Avoid branching when possible
• Move branching earlier in the pipeline when possible
• Largely SIMD coherent branching most efficient
• Mechanisms:
• Rasterized geometry
• Z-cull
• Occlusion query
41
Domain Decomposition
• Avoid branches where outcome is fixed
• One region is always true, another false
• Separate FPs for each region, no branches
• Example: boundaries
42
Z-Cull
• In early pass, modify depth buffer
• Write depth=0 for pixels that should not be modified by later passes
• Write depth=1 for rest
• Subsequent passes
• Enable depth test (GL_LESS)
• Draw full-screen quad at z=0.5
• Only pixels with previous depth=1 will be processed
• Can also use early stencil test
• Note: Depth replace disables ZCull
43
Pre-computation
• Pre-compute anything that will not change every iteration!
• Example: arbitrary boundaries
• When user draws boundaries, compute texture containing boundary info for cells
• Reuse that texture until boundaries modified
• Combine with Z-cull for higher performance!
44
Stream Operations
• Several stream operations in GPGPU toolkit:
• Map : apply a function to every element in a stream
• Reduce : use a function to reduce a stream to a smaller stream (often 1 element)
• Scatter/gather : indirect read and write
• Filter : select a subset of elements in a stream
• Sort : order elements in a stream
• Search : find a given element, nearest neighbors, etc
45
Data Structures
• GPU memory model, iteration, virtualization
• Dense arrays (== textures)
• Sparse arrays
• Static sparsity: pack rows or bands into textures, vertex arrays
• Dynamic sparsity: multidimensional page tables
46
Data Structures
• Adaptive data structures
• Static: packed uniform grid, stackless k-d tree
• Dynamic: mipmap of page tables, tree-based address translators
• Non-indexable structures
• Open & important problem: stacks, priority queues, sets, linked lists, hash tables, k-d tree construction…
47
Differential Equations
• Early & common application of GPGPU
• Ordinary differential equations (ODEs): commonly used in particle systems
• Partial differential equations (PDEs): well suited to
GPU, especially when solved on dense grids
• Closely related to…
48
Linear Algebra
• Major application of GPGPU
• Ubiquitous in science, engineering, visual simulation
• Several approaches explored in early GPGPU work
• Memory layout a key consideration
• Pack vectors into 2D textures
• Split matrices, e.g. by columns (dense), bands (banded), into vertex array (sparse)
49
Data Queries
• Relational database operations on the GPU
• Predicates, Boolean combinations, aggregations
• Join operations accelerated by sorting on join key
• Uses special-purpose depth & stencil hardware extensively
• Attributes of data record stored in (multiple) color channels of (multiple) textures
50
Applications
• STAR discusses the following broad GPGPU application areas:
• Physically-based simulation
• Signal & image processing
• Global illumination
• Geometric computing
• Databases & data mining
51
Physically-Based Simulation
• Early work (pre-fully programmable GPU):
• Cellular automata, texture & blending modes
• Now
• Finite difference & finite element methods
• Particle system simulations
• Most popular topic
• Fluids! Incompressible flow, clouds, boiling, etc.
• Also-ran: mass-spring dynamics for cloth
52
Signal & Image Processing
• Segmentation: identify features in 2D or 3D
• Common example: identify surface of 3D feature
(tumor, blood vessel, etc) in medical scan
• Level-set deformation methods evolve an isosurface
• Need sparse solution methods for efficient solution
• GPGPU bonus: easy to integrate in volume renderer
• Computer vision:
• Image projection, compositing, rectification
• Fast stereo depth extraction
53
Signal & Image Processing cont.
• Image processing
• Image registration, motion reconstruction, computed tomography (CT), tone mapping
• Core Image, Core Video
• Signal processing
• FFT: an interesting case! Memory bandwidth limited, lack of writable cache harms performance
54
Global Illumination
• Ray tracing
• Seminal GPGPU papers
• Close to the heart of graphics
• Earliest complex data structures, control flow
• Analysis to inform future hardware design
• Comparison to current efficient CPU implementation
• Key insights
• Ray-triangle intersection maps well to pixel hardware
• Rest of the ray tracing pipeline can also be expressed as a stream computation
55
Global Illumination cont.
• Photon mapping
• Even more involved data structures
• Introduced stencil routing (scatter), k-nearest neighbor search
• Radiosity
• Iterative LA solvers
• Subsurface scattering
• Lots of work on using GPUs to accelerate/ approximate scattering algorithms
56
Geometric Computing
• Lots of image-space geometric computations
• CSG operations
• Distance fields
• Collision detection
• Sorting for transparency
• Shadow generation
• Heavy use of stencil & depth hardware
57
Databases & Data Mining
• GPU strengths are useful
• Memory bandwidth
• Parallel processing
• Accelerating SQL queries – 10x improvement
• Also well suited for stream mining
• Continuous queries on streaming data instead of one-time queries on static database
58
Close-Up: Linear Algebra
Vector representation
– Textures best we can do
• Per-pixel vs. per-vertex operations
– 6 gigapixels/second vs. 0.7 gigavertices/second
– Efficient texture memory cache
– Texture read-write access
1
1 N
N
60
For details go: http://wwwcg.in.tum.de
Dense Matrix representation
– Treat a dense matrix as a set of column vectors
– Again, store these vectors as 2D textures
Matrix N Vectors i
N
N
...
...
1
N 2D-Textures
...
i
...
N
1 i N N
61
For details go: http://wwwcg.in.tum.de
Banded Sparse Matrix representation
– Treat a banded matrix as a set of diagonal vectors
– Combine opposing vectors to save space
2 Vectors i
Matrix
1
2 2D-Textures
2
N
N-i
N
1 2
N
62
For details go: http://wwwcg.in.tum.de
• Vector-Vector Operations
– Reduced to 2D texture operations
– Coded in pixel shaders
• Example: Vector1 + Vector2
Vector3
Vector 3 Vector 1 Vector 2
Static quad
Pass through
Vertex Shader
TexUnit 0 return tex0 + tex1
TexUnit 1
Render Texture
Pixel Shader
63
For details go: http://wwwcg.in.tum.de
Reduce operation for scalar products original Texture st
1 pass
...
nd
2 pass
...
...
...
Reduce m x n region in fragment shader
64
For details go: http://wwwcg.in.tum.de
In depth example: Vector / Banded-Matrix Multiplication
A b x
=
65
For details go: http://wwwcg.in.tum.de
Vector / Banded-Matrix Multiplication
A
A b x b
=
66
For details go: http://wwwcg.in.tum.de
Compute the result in 2 Passes:
A b b‘ x
=
67
For details go: http://wwwcg.in.tum.de
Conclusions
Moving Forward …
• What works well now?
• What doesn’t work well now?
• What will improve in the future?
• What will continue to be difficult?
69
What Runs Well on GPUs?
• GPUs win when …
• Limited data reuse
P4 3GHz
Memory BW
6 GB/s
Cache BW
44 GB/s
NV GF 6800 36 GB/s --
• High arithmetic intensity: Defined as math operations per memory op
• Attacks the memory wall - are all mem ops necessary?
• Common error: Not comparing against optimized
CPU implementation
70
Arithmetic Intensity
Historical growth rates (per year):
• Compute: 71%
• DRAM bandwidth: 25%
• DRAM latency: 5%
R300 R360
GFLOPS
7x Gap
GFloats/sec
R420
[courtesy Ian Buck] 71
Arithmetic Intensity
GeForce 7800 GTX
Pentium 4 3.0 GHz
GPU wins when…
• Arithmetic intensity
Segment
3.7 ops per word
11 GFLOPS
[courtesy Ian Buck] 72
Memory Bandwidth
GPU wins when…
• Streaming memory bandwidth
SAXPY
FFT
GeForce 7800 GTX
Pentium 4 3.0 GHz
[courtesy Ian Buck] 73
Memory Bandwidth
GeForce 7800 GTX Pentium 4
• Streaming Memory System
• Optimized for sequential performance
• GPU cache is limited
• Optimized for texture filtering
• Read-only
• Small
• Local storage
• CPU >> GPU
[courtesy Ian Buck] 74
What Will (Hopefully) Improve?
• Orthogonality
• Instruction sets
• Features
• Tools
• Stability
• Interfaces, APIs, libraries, abstractions
• Necessary as graphics and GPGPU converge!
75
What Won’t Change?
• Rate of progress
• Precision (64b floating point?)
• Parallelism
• Won’t sacrifice performance
• Difficulty of programming parallel hardware
• … but APIs and libraries may help
• Concentration on entertainment apps
76
Top Ten
GPGPU Top Ten
• The Killer App
• Programming models and tools
• GPU in tomorrow’s computer?
• Data conditionals
• Relationship to other parallel hw/sw
• Managing rapid change in hw/sw (roadmaps)
• Performance evaluation and cliffs
• Philosophy of faults and lack of precision
• Broader toolbox for computation / data structures
• Wedding graphics and
GPGPU techniques
78