GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke



CS 446: Real-Time Rendering & Game Technology

David Luebke University of Virginia


• Today: Matthew Rodgers • That’s it!

2 David Luebke


• New job: NVIDIA Research • New e-mail: [email protected]

3 David Luebke

Final Topic: GPGPU

• GPGPU: “General Purpose GPU Computing” • Active, exciting area of research and development • A personal interest of mine • Following slides taken from a recent talk I co-presented in Dublin – Accompanying this paper 4 David Luebke

A Survey of General-Purpose Computation on Graphics Hardware

John Owens University of California, Davis David Luebke University of Virginia with Naga Govindaraju, Mark Harris, Jens Kr ü ger, Aaron Lefohn, Tim Purcell


The graphics processing unit (

• Programmability


) on commodity video cards has evolved into an extremely flexible and powerful processor

• Precision • Power •


: an emerging field seeking to harness GPUs for general-purpose computation


Motivation: Computational Power

GPUs are fast…

• 3.0 GHz dual-core Pentium4: 24.6 GFLOPS • NVIDIA GeForceFX 7800: 165 GFLOPs • 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s • ATI Radeon X850 XT Platinum Edition: 37.8 GB/s •

GPUs are getting faster, faster

• CPUs: 1.4

× annual growth • GPUs: 1.7

× (pixels) to 2.3

× (vertices) annual growth 7

Motivation: Computational Power

Thanks to Ian Buck


Motivation: Computational Power

Thanks to John Owens


Motivation: Flexible and Precise

Modern GPUs are deeply programmable

• Programmable pixel, vertex, video engines • Solidifying high-level language support •

Modern GPUs support high precision

• 32 bit floating point throughout the pipeline • High enough for many (not all) applications 10

Motivation: The Potential of GPGPU

In short:

• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation • Example applications range from in-game physics simulation to conventional computational science • Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor 11

Problems: Difficult To Use

GPUs designed for & driven by video games

• Programming model unusual • Programming idioms tied to computer graphics • Programming environment tightly constrained •

Underlying architectures are:

• Inherently parallel • Rapidly evolving (even in basic feature set!) • Largely secret •

Can’t simply “port” CPU code!


Problems: Not A Panacea

GPUs are fast because they are specialized

• Poorly suited to sequential, “pointer-chasing” code •

Missing support for some basic functionality

• E.g. integers, bitwise operations, indexed write •

More on limitations & difficulties later


STAR Goals

Detailed & useful survey of general-purpose computing on graphics hardware

• Hardware and software developments behind GPGPU • Building blocks : techniques for mapping general purpose computation to the GPU • Applications : important applications of GPGPU • A comprehensive GPGPU bibliography (current through Summer 2005…) 14

Architectural Considerations

The past: 1987

20 MIPS CPU 1987 [courtesy Anant Agarwal] 16

The future: 2007

1 Billion Transistors 2007 [courtesy Anant Agarwal] 17

Today’s VLSI Capability: Keys to High Performance


64-bit FPU (to scale) 50pJ/FLOP 1 clock 90nm Chip $200 1GHz 12mm [courtesy Pat Hanrahan] 18

For High Performance, We Must …


64-bit FPU (to scale)

1. Exploit


Ample Computation!

1 clock 90nm Chip $200 1GHz 12mm [courtesy Pat Hanrahan] 19

For High Performance, We Must …


64-bit FPU (to scale)

1. Exploit


Ample Computation!

1 clock 90nm Chip $200 1GHz

2. Require Efficient Communication!

12mm [courtesy Pat Hanrahan] 20

Stream Programming Abstraction

Let’s think about our problem in a new way

• Goal: SW programming model that matches today’s VLSI •


• Collection of data records

stream kernel

• All data is expressed in streams •


• Inputs/outputs are streams


• Perform computation on streams • Can be chained together 21

Why Streams?

Ample computation

by exposing parallelism

Streams expose data parallelism • • Multiple stream elements can be processed in parallel Pipeline (task) parallelism • • Multiple tasks can be processed in parallel Kernels yield high arithmetic intensity •

Efficient communication

• Producer-consumer locality • Predictable memory access pattern • • Optimize for throughput of all elements, not latency of one Processing many elements at once allows latency hiding 22

GPU: Special-Purpose Graphics Hardware

• • • • • •

Task-parallel organization Assign each task to processing unit Hardwire each unit to specific task - huge performance advantage!

Provides ample computation resources Efficient communication patterns Dominant graphics architecture

[ATI Flipper – 51M T] 23

The Rendering Pipeline

Application Compute 3D geometry Make calls to graphics API Geometry Transform geometry from 3D to 2D (in parallel) Rasterization Composite


Generate fragments from 2D geometry (in parallel) Combine fragments into image 24

The Programmable Rendering Pipeline

Application Compute 3D geometry Make calls to graphics API Geometry (Vertex) Transform geometry from 3D to 2D;

vertex programs

Rasterization (Fragment) Composite


Generate fragments from 2D geometry;

fragment programs

Combine fragments into image 25

NVIDIA GeForce 6800 3D Pipeline

Vertex Triangle Setup Shader Instruction Dispatch Z-Cull Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition

Courtesy Nick Triantos, NVIDIA


Characteristics of Graphics Apps

• • • • • •

Lots of arithmetic Lots of parallelism Simple control Multiple stages Feed forward pipelines Latency-tolerant / deep pipelines

Application Command Geometry Rasterization Fragment •

What other applications have these characteristics?

Composite Display 27

Today’s Graphics Pipeline

Application Command Geometry Rasterization Fragment Composite Display • • •

Graphics is well suited to:

• The stream programming model • Stream hardware organizations • GPUs are a commodity stream processor!

What if we could apply these techniques to more general purpose problems?

GPUs should excel at tasks that require:

• Ample computation • Regular computation • Efficient communication 28

Programming a GPU for Graphics

• • • •

Application specifies geometry

rasterized Each fragment is shaded w/ SIMD program Shading can use values from texture memory Image can be used as texture on future passes


Programming a GPU for GP Programs

• • • •

Draw a screen-sized quad

stream Run a SIMD kernel each fragment over “Gather” is permitted from texture memory Resulting buffer can be treated as texture on next pass



Each algorithm step depend on the results of previous steps

Each time step depends on the results of the previous time step


CPU-GPU Analogies




Grid[i][j]= x; .



Array Write = Render to Texture


CPU-GPU Analogies

CPU GPU Stream / Data Array = Texture Memory Read = Texture Sample




Kernel / loop body / algorithm step = Fragment Program 34

Scatter vs. Gather

Grid communication

• Grid cells share information 35

Computational Resources Inventory

Programmable parallel processors

• Vertex & Fragment pipelines •


• Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants •

Texture unit

• Read-only memory interface •

Render to texture

• Write-only memory interface 36

Vertex Processor

• • •

Fully programmable (SIMD / MIMD) Processes 4-vectors (RGBA / XYZW) Capable of scatter but not gather

• Can change the location of current vertex • Cannot read info from other vertices • Can only read a small constant memory •

Latest GPUs: Vertex Texture Fetch

• Random access memory for vertices •  Gather (But not from the vertex stream itself) 37

Fragment Processor

• • • •

Fully programmable (SIMD) Processes 4-component vectors (RGBA / XYZW) Random access memory read (textures) Capable of gather but not scatter

• RAM read (texture fetch), but no RAM write • Output address fixed to a specific pixel •

Paper details ways to synthesize scatter

Typically more useful than vertex processor

• More fragment pipelines than vertex pipelines • Direct output (fragment processor is at end of pipeline) 38

Building Blocks & Applications

GPGPU Building Blocks

The STAR covers the following fundamental techniques & computational building blocks:

• Flow control (a


fundamental building block) • Stream operations • Data structures • Differential equations & linear algebra • Data queries •

I’ll discuss each in a bit more detail


Flow control

• •

Surprising number of issues on GPUs Main themes:

• Avoid branching when possible • Move branching earlier in the pipeline when possible • Largely SIMD  coherent branching most efficient •


• Rasterized geometry • Z-cull • Occlusion query 41

Domain Decomposition

Avoid branches where outcome is fixed

• One region is always true, another false • Separate FPs for each region, no branches •

Example: boundaries



In early pass, modify depth buffer

• Write depth=0 for pixels that should not be modified by later passes • Write depth=1 for rest •

Subsequent passes

• Enable depth test (GL_LESS) • Draw full-screen quad at z=0.5

• Only pixels with previous depth=1 will be processed • •

Can also use early stencil test Note: Depth replace disables ZCull



Pre-compute anything that will not change every iteration!

Example: arbitrary boundaries

• When user draws boundaries, compute texture containing boundary info for cells • Reuse that texture until boundaries modified • Combine with Z-cull for higher performance!


Stream Operations

Several stream operations in GPGPU toolkit:

• Map : apply a function to every element in a stream • Reduce : use a function to reduce a stream to a smaller stream (often 1 element) • Scatter/gather : indirect read and write • Filter : select a subset of elements in a stream • Sort : order elements in a stream • Search : find a given element, nearest neighbors, etc 45

Data Structures

• • •

GPU memory model, iteration, virtualization Dense arrays (== textures) Sparse arrays

• Static sparsity: pack rows or bands into textures, vertex arrays • Dynamic sparsity: multidimensional page tables 46

Data Structures

Adaptive data structures

• Static: packed uniform grid, stackless k-d tree • Dynamic: mipmap of page tables, tree-based address translators •

Non-indexable structures

• Open & important problem: stacks, priority queues, sets, linked lists, hash tables, k-d tree construction… 47

Differential Equations

Early & common application of GPGPU

• Ordinary differential equations (ODEs): commonly used in particle systems • Partial differential equations (PDEs): well suited to GPU, especially when solved on dense grids • Closely related to… 48

Linear Algebra

Major application of GPGPU

• Ubiquitous in science, engineering, visual simulation • Several approaches explored in early GPGPU work •

Memory layout a key consideration

• Pack vectors into 2D textures • Split matrices, e.g. by columns (dense), bands (banded), into vertex array (sparse) 49

Data Queries

Relational database operations on the GPU

• Predicates, Boolean combinations, aggregations • Join operations accelerated by sorting on join key •

Uses special-purpose depth & stencil hardware extensively

Attributes of data record stored in (multiple) color channels of (multiple) textures



STAR discusses the following broad GPGPU application areas:

• Physically-based simulation • Signal & image processing • Global illumination • Geometric computing • Databases & data mining 51

Physically-Based Simulation

Early work (pre-fully programmable GPU):

• Cellular automata, texture & blending modes •


• Finite difference & finite element methods • Particle system simulations •

Most popular topic

• Fluids! Incompressible flow, clouds, boiling, etc.

• Also-ran: mass-spring dynamics for cloth 52

Signal & Image Processing

Segmentation: identify features in 2D or 3D

• Common example: identify surface of 3D feature (tumor, blood vessel, etc) in medical scan • Level-set deformation methods evolve an isosurface • Need sparse solution methods for efficient solution • GPGPU bonus: easy to integrate in volume renderer •

Computer vision:

• Image projection, compositing, rectification • Fast stereo depth extraction 53

Signal & Image Processing cont.

Image processing

• Image registration, motion reconstruction, computed tomography (CT), tone mapping • Core Image, Core Video •

Signal processing

• FFT: an interesting case! Memory bandwidth limited, lack of writable cache harms performance 54

Global Illumination

Ray tracing

• Seminal GPGPU papers • Close to the heart of graphics • Earliest complex data structures, control flow • Analysis to inform future hardware design • Comparison to current efficient CPU implementation • Key insights • Ray-triangle intersection maps well to pixel hardware • Rest of the ray tracing pipeline can also be expressed as a stream computation 55

Global Illumination cont.

Photon mapping

• Even more involved data structures • Introduced stencil routing (scatter), k-nearest neighbor search •


• Iterative LA solvers •

Subsurface scattering

• Lots of work on using GPUs to accelerate/ approximate scattering algorithms 56

Geometric Computing

Lots of image-space geometric computations

• CSG operations • Distance fields • Collision detection • Sorting for transparency • Shadow generation •

Heavy use of stencil & depth hardware


Databases & Data Mining

GPU strengths are useful

• Memory bandwidth • Parallel processing • •

Accelerating SQL queries – 10x improvement Also well suited for stream mining

• Continuous queries on streaming data instead of one-time queries on static database 58

Close-Up: Linear Algebra

Linear Algebra Data Structures

Vector representation – 2D textures really rock • Per-pixel vs. per-vertex operations – – – 6 gigapixels/second vs. 0.7 gigavertices/second Efficient texture memory cache Texture read-write access 1 1 N N 60 For details go:

Representation (cont.)

Dense Matrix representation – – Treat a dense matrix as a set of column vectors Again, store these vectors as 2D textures Matrix N Vectors i N N ...


1 N 2D-Textures ...

i ...

N 1 i N N 61 For details go:

Representation (cont.)

Banded Sparse Matrix representation – – Treat a banded matrix as a set of


vectors Combine opposing vectors to save space i Matrix 2 Vectors 1 2 2D-Textures 2 N N-i N 1 2 N 62 For details go:

Operations 1

• Vector-Vector Operations – Reduced to 2D texture operations – Coded in pixel shaders • Example:

Vector1 + Vector2


Vector 3 Vector 1 Vector 2 Static quad Pass through Vertex Shader TexUnit 0 return tex0 + tex1 TexUnit 1 Render Texture Pixel Shader

63 For details go:

Operations 2 (reduce)

Reduce operation for scalar products original Texture st 1 pass ...

nd 2 pass ...



Reduce m x n region in fragment shader 64 For details go:


In depth example:

Vector / Banded-Matrix Multiplication

A b x = 65 For details go:

Example (cont.)

Vector / Banded-Matrix Multiplication

A A b x b = 66 For details go:

Example (cont.)

Compute the result in 2 Passes: A b b‘ x = 67 For details go:


Moving Forward …

• • • •

What works well now?

What doesn’t work well now?

What will improve in the future?

What will continue to be difficult?


What Runs Well on GPUs?

GPUs win when …

Limited data reuse

P4 3GHz Memory BW 6 GB/s Cache BW 44 GB/s NV GF 6800 36 GB/s - • High arithmetic intensity: Defined as math operations per memory op • Attacks the memory wall - are all mem ops necessary?

Common error: Not comparing against optimized CPU implementation 70

Arithmetic Intensity

Historical growth rates (per year):

• Compute: 71% • DRAM bandwidth: 25% • DRAM latency: 5%

R300 R360 GFLOPS

7x Gap

GFloats/sec R420

[courtesy Ian Buck] 71

Arithmetic Intensity

GeForce 7800 GTX Pentium 4 3.0 GHz GPU wins when…

Arithmetic intensity

 Segment 3.7 ops per word 11 GFLOPS [courtesy Ian Buck] 72

Memory Bandwidth

GPU wins when…

Streaming memory bandwidth


GeForce 7800 GTX Pentium 4 3.0 GHz

[courtesy Ian Buck] 73

Memory Bandwidth

GeForce 7800 GTX Pentium 4

Streaming Memory System

• Optimized for sequential performance •

GPU cache is limited

• Optimized for texture filtering • Read-only • Small •

Local storage

• CPU >> GPU [courtesy Ian Buck] 74

What Will (Hopefully) Improve?


• Instruction sets • Features • Tools • •

Stability Interfaces, APIs, libraries, abstractions

• Necessary as graphics and GPGPU converge!


What Won’t Change?

• • •

Rate of progress Precision (64b floating point?) Parallelism

• Won’t sacrifice performance •

Difficulty of programming parallel hardware

• … but APIs and libraries may help •

Concentration on entertainment apps


Top Ten


• • • • •

The Killer App Programming models and tools GPU in tomorrow’s computer?

Data conditionals Relationship to other parallel hw/sw

• • • • •

Managing rapid change in hw/sw (roadmaps) Performance evaluation and cliffs Philosophy of faults and lack of precision Broader toolbox for computation / data structures Wedding graphics and GPGPU techniques