GPGPU CS 446: Real-Time Rendering & Game Technology David Luebke

advertisement

GPGPU

CS 446: Real-Time Rendering

& Game Technology

David Luebke

University of Virginia

Demo

• Today: Matthew Rodgers

• That’s it!

2 David Luebke

News

• New job: NVIDIA Research

• New e-mail: dave@luebke.us

3 David Luebke

Final Topic: GPGPU

• GPGPU: “General Purpose GPU Computing”

• Active, exciting area of research and development

• A personal interest of mine

• Following slides taken from a recent talk I co-presented in Dublin

– Accompanying this paper

4 David Luebke

A Survey of General-Purpose

Computation on Graphics

Hardware

John Owens

University of California, Davis

David Luebke

University of Virginia with Naga Govindaraju, Mark Harris, Jens Kr ü ger, Aaron Lefohn, Tim Purcell

Introduction

• The graphics processing unit ( GPU ) on commodity video cards has evolved into an extremely flexible and powerful processor

• Programmability

• Precision

• Power

• GPGPU : an emerging field seeking to harness

GPUs for general-purpose computation

6

Motivation: Computational Power

• GPUs are fast…

• 3.0 GHz dual-core Pentium4: 24.6 GFLOPS

• NVIDIA GeForceFX 7800: 165 GFLOPs

• 1066 MHz FSB Pentium Extreme Edition : 8.5 GB/s

• ATI Radeon X850 XT Platinum Edition: 37.8 GB/s

• GPUs are getting faster, faster

• CPUs: 1.4

× annual growth

• GPUs: 1.7

× (pixels) to 2.3

× (vertices) annual growth

7

Motivation: Computational Power

Thanks to Ian Buck 8

Motivation: Computational Power

Thanks to John Owens 9

Motivation: Flexible and Precise

• Modern GPUs are deeply programmable

• Programmable pixel, vertex, video engines

• Solidifying high-level language support

• Modern GPUs support high precision

• 32 bit floating point throughout the pipeline

• High enough for many (not all) applications

10

Motivation: The Potential of GPGPU

• In short:

• The power and flexibility of GPUs makes them an attractive platform for general-purpose computation

• Example applications range from in-game physics simulation to conventional computational science

• Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor

11

Problems: Difficult To Use

• GPUs designed for & driven by video games

• Programming model unusual

• Programming idioms tied to computer graphics

• Programming environment tightly constrained

• Underlying architectures are:

• Inherently parallel

• Rapidly evolving (even in basic feature set!)

• Largely secret

• Can’t simply “port” CPU code!

12

Problems: Not A Panacea

• GPUs are fast because they are specialized

• Poorly suited to sequential, “pointer-chasing” code

• Missing support for some basic functionality

• E.g. integers, bitwise operations, indexed write

• More on limitations & difficulties later

13

STAR Goals

• Detailed & useful survey of general-purpose computing on graphics hardware

• Hardware and software developments behind GPGPU

• Building blocks : techniques for mapping generalpurpose computation to the GPU

• Applications : important applications of GPGPU

• A comprehensive GPGPU bibliography (current through Summer 2005…)

14

Architectural Considerations

The past: 1987

20 MIPS CPU

1987

[courtesy Anant Agarwal]

16

The future: 2007

1 Billion Transistors

2007

[courtesy Anant Agarwal]

17

Today’s VLSI Capability: Keys to High Performance

0.5mm

64-bit FPU

(to scale)

50pJ/FLOP

1 clock

90nm Chip

$200

1GHz

12mm

[courtesy Pat Hanrahan]

18

For High Performance, We Must …

0.5mm

64-bit FPU

(to scale)

1. Exploit

50pJ/FLOP

Ample

Computation!

1 clock

90nm Chip

$200

1GHz

12mm

[courtesy Pat Hanrahan]

19

For High Performance, We Must …

0.5mm

64-bit FPU

(to scale)

1. Exploit

50pJ/FLOP

Ample

Computation!

1 clock

90nm Chip

$200

1GHz

2. Require

Efficient

Communication!

12mm

[courtesy Pat Hanrahan]

20

Stream Programming Abstraction

• Let’s think about our problem in a new way

• Goal: SW programming model that matches today’s VLSI

• Streams

• Collection of data records stream kernel

• All data is expressed in streams stream • Kernels

• Inputs/outputs are streams

• Perform computation on streams

• Can be chained together

21

Why Streams?

• Ample computation by exposing parallelism

• Streams expose data parallelism

• Multiple stream elements can be processed in parallel

• Pipeline (task) parallelism

• Multiple tasks can be processed in parallel

• Kernels yield high arithmetic intensity

• Efficient communication

• Producer-consumer locality

• Predictable memory access pattern

• Optimize for throughput of all elements, not latency of one

• Processing many elements at once allows latency hiding

22

GPU: Special-Purpose Graphics Hardware

• Task-parallel organization

• Assign each task to processing unit

• Hardwire each unit to specific task - huge performance advantage!

• Provides ample computation resources

• Efficient communication patterns

• Dominant graphics architecture

[ATI Flipper – 51M T]

23

The Rendering Pipeline

Application

Compute 3D geometry

Make calls to graphics API

Geometry

Transform geometry from 3D to

2D (in parallel)

Rasterization

Composite

GPU

Generate fragments from 2D geometry (in parallel)

Combine fragments into image

24

The Programmable Rendering Pipeline

Application

Compute 3D geometry

Make calls to graphics API

Geometry

(Vertex)

Transform geometry from 3D to

2D; vertex programs

Rasterization

(Fragment)

Composite

GPU

Generate fragments from 2D geometry; fragment programs

Combine fragments into image

25

NVIDIA GeForce 6800 3D Pipeline

Vertex

Triangle Setup

Shader Instruction Dispatch Z-Cull

Fragment

L2 Tex

Fragment Crossbar

Composite

Memory

Partition

Memory

Partition

Memory

Partition

Memory

Partition

Courtesy Nick Triantos, NVIDIA

26

Characteristics of Graphics Apps

• Lots of arithmetic

• Lots of parallelism

• Simple control

• Multiple stages

• Feed forward pipelines

• Latency-tolerant / deep pipelines

Application

Command

Geometry

Rasterization

Fragment

• What other applications have these characteristics?

Composite

Display

27

Today’s Graphics Pipeline

Application

Command

Geometry

Rasterization

Fragment

Composite

Display

• Graphics is well suited to:

• The stream programming model

• Stream hardware organizations

• GPUs are a commodity stream processor!

• What if we could apply these techniques to more generalpurpose problems?

• GPUs should excel at tasks that require:

• Ample computation

• Regular computation

• Efficient communication

28

Programming a GPU for Graphics

• Application specifies geometry  rasterized

• Each fragment is shaded w/

SIMD program

• Shading can use values from texture memory

• Image can be used as texture on future passes

29

Programming a GPU for GP Programs

• Draw a screen-sized quad  stream

• Run a SIMD kernel over each fragment

• “Gather” is permitted from texture memory

• Resulting buffer can be treated as texture on next pass

30

Feedback

• Each algorithm step depend on the results of previous steps

• Each time step depends on the results of the previous time step

31

CPU-GPU Analogies

GPU

.

.

Grid[i][j]= x;

.

.

.

Array Write = Render to Texture

32

CPU-GPU Analogies

CPU GPU

Stream / Data Array = Texture

Memory Read = Texture Sample

33

Kernels

CPU GPU

Kernel / loop body / algorithm step = Fragment Program

34

Scatter vs. Gather

• Grid communication

• Grid cells share information

35

Computational Resources Inventory

• Programmable parallel processors

• Vertex & Fragment pipelines

• Rasterizer

• Mostly useful for interpolating addresses (texture coordinates) and per-vertex constants

• Texture unit

• Read-only memory interface

• Render to texture

• Write-only memory interface

36

Vertex Processor

• Fully programmable (SIMD / MIMD)

• Processes 4-vectors (RGBA / XYZW)

• Capable of scatter but not gather

• Can change the location of current vertex

• Cannot read info from other vertices

• Can only read a small constant memory

• Latest GPUs: Vertex Texture Fetch

• Random access memory for vertices

•  Gather (But not from the vertex stream itself)

37

Fragment Processor

• Fully programmable (SIMD)

• Processes 4-component vectors (RGBA / XYZW)

• Random access memory read (textures)

• Capable of gather but not scatter

• RAM read (texture fetch), but no RAM write

• Output address fixed to a specific pixel

• Paper details ways to synthesize scatter

• Typically more useful than vertex processor

• More fragment pipelines than vertex pipelines

• Direct output (fragment processor is at end of pipeline)

38

Building Blocks & Applications

GPGPU Building Blocks

• The STAR covers the following fundamental techniques & computational building blocks:

• Flow control (a very fundamental building block)

• Stream operations

• Data structures

• Differential equations & linear algebra

• Data queries

• I’ll discuss each in a bit more detail

40

Flow control

• Surprising number of issues on GPUs

• Main themes:

• Avoid branching when possible

• Move branching earlier in the pipeline when possible

• Largely SIMD  coherent branching most efficient

• Mechanisms:

• Rasterized geometry

• Z-cull

• Occlusion query

41

Domain Decomposition

• Avoid branches where outcome is fixed

• One region is always true, another false

• Separate FPs for each region, no branches

• Example: boundaries

42

Z-Cull

• In early pass, modify depth buffer

• Write depth=0 for pixels that should not be modified by later passes

• Write depth=1 for rest

• Subsequent passes

• Enable depth test (GL_LESS)

• Draw full-screen quad at z=0.5

• Only pixels with previous depth=1 will be processed

• Can also use early stencil test

• Note: Depth replace disables ZCull

43

Pre-computation

• Pre-compute anything that will not change every iteration!

• Example: arbitrary boundaries

• When user draws boundaries, compute texture containing boundary info for cells

• Reuse that texture until boundaries modified

• Combine with Z-cull for higher performance!

44

Stream Operations

• Several stream operations in GPGPU toolkit:

• Map : apply a function to every element in a stream

• Reduce : use a function to reduce a stream to a smaller stream (often 1 element)

• Scatter/gather : indirect read and write

• Filter : select a subset of elements in a stream

• Sort : order elements in a stream

• Search : find a given element, nearest neighbors, etc

45

Data Structures

• GPU memory model, iteration, virtualization

• Dense arrays (== textures)

• Sparse arrays

• Static sparsity: pack rows or bands into textures, vertex arrays

• Dynamic sparsity: multidimensional page tables

46

Data Structures

• Adaptive data structures

• Static: packed uniform grid, stackless k-d tree

• Dynamic: mipmap of page tables, tree-based address translators

• Non-indexable structures

• Open & important problem: stacks, priority queues, sets, linked lists, hash tables, k-d tree construction…

47

Differential Equations

• Early & common application of GPGPU

• Ordinary differential equations (ODEs): commonly used in particle systems

• Partial differential equations (PDEs): well suited to

GPU, especially when solved on dense grids

• Closely related to…

48

Linear Algebra

• Major application of GPGPU

• Ubiquitous in science, engineering, visual simulation

• Several approaches explored in early GPGPU work

• Memory layout a key consideration

• Pack vectors into 2D textures

• Split matrices, e.g. by columns (dense), bands (banded), into vertex array (sparse)

49

Data Queries

• Relational database operations on the GPU

• Predicates, Boolean combinations, aggregations

• Join operations accelerated by sorting on join key

• Uses special-purpose depth & stencil hardware extensively

• Attributes of data record stored in (multiple) color channels of (multiple) textures

50

Applications

• STAR discusses the following broad GPGPU application areas:

• Physically-based simulation

• Signal & image processing

• Global illumination

• Geometric computing

• Databases & data mining

51

Physically-Based Simulation

• Early work (pre-fully programmable GPU):

• Cellular automata, texture & blending modes

• Now

• Finite difference & finite element methods

• Particle system simulations

• Most popular topic

• Fluids! Incompressible flow, clouds, boiling, etc.

• Also-ran: mass-spring dynamics for cloth

52

Signal & Image Processing

• Segmentation: identify features in 2D or 3D

• Common example: identify surface of 3D feature

(tumor, blood vessel, etc) in medical scan

• Level-set deformation methods evolve an isosurface

• Need sparse solution methods for efficient solution

• GPGPU bonus: easy to integrate in volume renderer

• Computer vision:

• Image projection, compositing, rectification

• Fast stereo depth extraction

53

Signal & Image Processing cont.

• Image processing

• Image registration, motion reconstruction, computed tomography (CT), tone mapping

• Core Image, Core Video

• Signal processing

• FFT: an interesting case! Memory bandwidth limited, lack of writable cache harms performance

54

Global Illumination

• Ray tracing

• Seminal GPGPU papers

• Close to the heart of graphics

• Earliest complex data structures, control flow

• Analysis to inform future hardware design

• Comparison to current efficient CPU implementation

• Key insights

• Ray-triangle intersection maps well to pixel hardware

• Rest of the ray tracing pipeline can also be expressed as a stream computation

55

Global Illumination cont.

• Photon mapping

• Even more involved data structures

• Introduced stencil routing (scatter), k-nearest neighbor search

• Radiosity

• Iterative LA solvers

• Subsurface scattering

• Lots of work on using GPUs to accelerate/ approximate scattering algorithms

56

Geometric Computing

• Lots of image-space geometric computations

• CSG operations

• Distance fields

• Collision detection

• Sorting for transparency

• Shadow generation

• Heavy use of stencil & depth hardware

57

Databases & Data Mining

• GPU strengths are useful

• Memory bandwidth

• Parallel processing

• Accelerating SQL queries – 10x improvement

• Also well suited for stream mining

• Continuous queries on streaming data instead of one-time queries on static database

58

Close-Up: Linear Algebra

Linear Algebra Data Structures

Vector representation

– Textures best we can do

• Per-pixel vs. per-vertex operations

– 6 gigapixels/second vs. 0.7 gigavertices/second

– Efficient texture memory cache

– Texture read-write access

1

1 N

N

60

For details go: http://wwwcg.in.tum.de

Representation (cont.)

Dense Matrix representation

– Treat a dense matrix as a set of column vectors

– Again, store these vectors as 2D textures

Matrix N Vectors i

N

N

...

...

1

N 2D-Textures

...

i

...

N

1 i N N

61

For details go: http://wwwcg.in.tum.de

Representation (cont.)

Banded Sparse Matrix representation

– Treat a banded matrix as a set of diagonal vectors

– Combine opposing vectors to save space

2 Vectors i

Matrix

1

2 2D-Textures

2

N

N-i

N

1 2

N

62

For details go: http://wwwcg.in.tum.de

Operations 1

• Vector-Vector Operations

– Reduced to 2D texture operations

– Coded in pixel shaders

• Example: Vector1 + Vector2

Vector3

Vector 3 Vector 1 Vector 2

Static quad

Pass through

Vertex Shader

TexUnit 0 return tex0 + tex1

TexUnit 1

Render Texture

Pixel Shader

63

For details go: http://wwwcg.in.tum.de

Operations 2 (reduce)

Reduce operation for scalar products original Texture st

1 pass

...

nd

2 pass

...

...

...

Reduce m x n region in fragment shader

64

For details go: http://wwwcg.in.tum.de

Operations

In depth example: Vector / Banded-Matrix Multiplication

A b x

=

65

For details go: http://wwwcg.in.tum.de

Example (cont.)

Vector / Banded-Matrix Multiplication

A

A b x b

=

66

For details go: http://wwwcg.in.tum.de

Example (cont.)

Compute the result in 2 Passes:

A b b‘ x

=

67

For details go: http://wwwcg.in.tum.de

Conclusions

Moving Forward …

• What works well now?

• What doesn’t work well now?

• What will improve in the future?

• What will continue to be difficult?

69

What Runs Well on GPUs?

• GPUs win when …

• Limited data reuse

P4 3GHz

Memory BW

6 GB/s

Cache BW

44 GB/s

NV GF 6800 36 GB/s --

High arithmetic intensity: Defined as math operations per memory op

• Attacks the memory wall - are all mem ops necessary?

Common error: Not comparing against optimized

CPU implementation

70

Arithmetic Intensity

Historical growth rates (per year):

• Compute: 71%

• DRAM bandwidth: 25%

• DRAM latency: 5%

R300 R360

GFLOPS

7x Gap

GFloats/sec

R420

[courtesy Ian Buck] 71

Arithmetic Intensity

GeForce 7800 GTX

Pentium 4 3.0 GHz

GPU wins when…

• Arithmetic intensity

 Segment

3.7 ops per word

11 GFLOPS

[courtesy Ian Buck] 72

Memory Bandwidth

GPU wins when…

• Streaming memory bandwidth

 SAXPY

 FFT

GeForce 7800 GTX

Pentium 4 3.0 GHz

[courtesy Ian Buck] 73

Memory Bandwidth

GeForce 7800 GTX Pentium 4

• Streaming Memory System

• Optimized for sequential performance

• GPU cache is limited

• Optimized for texture filtering

• Read-only

• Small

• Local storage

• CPU >> GPU

[courtesy Ian Buck] 74

What Will (Hopefully) Improve?

• Orthogonality

• Instruction sets

• Features

• Tools

• Stability

• Interfaces, APIs, libraries, abstractions

• Necessary as graphics and GPGPU converge!

75

What Won’t Change?

• Rate of progress

• Precision (64b floating point?)

• Parallelism

• Won’t sacrifice performance

• Difficulty of programming parallel hardware

• … but APIs and libraries may help

• Concentration on entertainment apps

76

Top Ten

GPGPU Top Ten

• The Killer App

• Programming models and tools

• GPU in tomorrow’s computer?

• Data conditionals

• Relationship to other parallel hw/sw

• Managing rapid change in hw/sw (roadmaps)

• Performance evaluation and cliffs

• Philosophy of faults and lack of precision

• Broader toolbox for computation / data structures

• Wedding graphics and

GPGPU techniques

78

Download