Programmable Graphics Hardware CS 446: Real-Time Rendering & Game Technology

advertisement

Programmable

Graphics Hardware

CS 446: Real-Time Rendering

& Game Technology

David Luebke

University of Virginia

Recap: Advanced Texturing

• Billboards

– Screen-aligned, world-aligned

• Point sprites

• Imposters

– Trees, buildings, portal textures, billboard clouds

– Dynamic imposters for “caching” rendering results

• Depth textures

• Multitexturing

– Low-res light maps, hi-res decals, etc

Real-Time Rendering 2 David Luebke

Textures: Other Important Stuff

• Render to texture – framebuffer objects (FBOs)

– Multiple render targets

• Environment maps

– Sphere map, cube maps (hardware supported)

• Shadow maps

– A depth texture rendered from light source (more later)

• Relief textures

– Demo now, details later

3 David Luebke Real-Time Rendering

Textures: Still More Stuff

Normal maps – especially for bump mapping

– Gloss maps, reflectance maps, etc

• Generally:

– Think of textures as global memory for fragment programs, with built-in filtering

– Just starting to be able to access textures in vertex programs too

(NVIDIA hardware only, today)

• Deferred shading

Projective texture mapping

4 David Luebke Real-Time Rendering

Next topic: Cg

• Many of the techniques we discuss in this class do not depend on programmable graphics hardware

– But even those are often easier to implement!

• And programmable graphics opens up an endless number of tricks and techniques that could not have been efficiently implemented before

• So, the next topic is a brief intro to Cg

– My apologies to those of you who’ve seen this

– My apologies to those of you who haven’t

Real-Time Rendering 5 David Luebke

Acknowledgement & Aside

• Much of this lecture comes from Bill Mark’s

SIGGRAPH 2002 course talk on NVIDIA’s programmable graphics technology

• For this reason, and because the lab is outfitted with NVIDIA cards, we will focus on NVIDIA tech

• I try to mention similarities and differences with ATI, the other main GPU vendor, in lecture and slides

• Note: many/most images are from NVIDIA as well

Real-Time Rendering 6 David Luebke

The Graphics Pipeline

Application

Transform

& Light

Assemble

Primitives

Graphics State

Rasterize Shade

CPU GPU

• A simplified graphics pipeline

– Note that pipe widths vary

– Many caches, FIFOs, and so on not shown

Render-to-texture

Video

Memory

(Textures)

Real-Time Rendering 7 David Luebke

GPU Pipeline: Transform

• Transform & light (a.k.a. vertex processor)

– Transform from “world space” to “image space”

– Compute per-vertex lighting

Real-Time Rendering 8 Courtesy Mark Harris David Luebke

GPU Pipeline: Rasterize

• Rasterizer

– Convert geometric rep. (vertex) to image rep. (fragment)

• Fragment = image fragment

– Pixel + associated data: color, depth, stencil, etc.

– Interpolate per-vertex quantities across pixels

9 Courtesy Mark Harris David Luebke Real-Time Rendering

GPU Pipeline: Shade

• Fragment processors (multiple in parallel)

– Compute a color for each pixel

– Optionally read colors from textures (images)

The Modern Graphics Pipeline

Application

Assemble

Primitives

Graphics State

Rasterize

Processor

CPU

• Programmable vertex processor!

GPU

Render-to-texture

• Programmable pixel processor!

Video

Memory

(Textures)

11 David Luebke Real-Time Rendering

The Coming Soon Graphics Pipeline

Graphics State

Application

Vertex

Processor

Rasterize

Fragment

Processor

Video

Memory

(Textures)

CPU

• Programmable primitive assembly!

GPU

Render-to-texture

• More flexible memory access!

12 David Luebke Real-Time Rendering

Precision

• 32-bit IEEE floating-point throughout pipeline

– Framebuffer

– Textures

– Fragment processor

– Vertex processor

– Interpolants

Real-Time Rendering 13 David Luebke

Multiple data types in hardware

• Can support 32-bit IEEE floating point throughout pipeline

– Vertices, interpolants, framebuffer, textures, computations

• Fragment processor also supports:

– 16-bit “half” floating point, 12-bit fixed point

– These may be faster than 32-bit

• Framebuffer/textures also support:

– Large variety of fixed-point formats

• E.g., classical 8-bit per component RGBA, BGRA, etc.

– These formats use less memory bandwidth than FP32

14 David Luebke Real-Time Rendering

Vertex processor capabilities

• 4-vector FP32 operations

• Condition codes + true data-dependent control flow

– Conditional branches, subroutine calls, jump table

– Useful for avoiding extra work, e.g.:

• Don’t do animation, skinning if vertex will be clipped

• Do displacement mapping only for vertices near silhouette

– Transcendental arithmetic instructions (e.g. COS)

• User clip-plane support

• Texture reads (up to 4 textures, unlimited lookups)

Real-Time Rendering 15 David Luebke

Vertex processor limitations

• No arbitrary memory write

• No “vertex kill”

– Can put vertex off-screen

– Can make degenerate primitives

• Only 32-bit texture formats supported

Real-Time Rendering 16 David Luebke

NV40-G70 vertex processor resources

• 65535 instructions per program

• Other statistics (NV30, not sure about NV40-G70):

– 16 temporary 4-vector registers

– 256 “uniform” parameter registers

– 2 address registers (4-vector)

– 6 clip-distance outputs

Real-Time Rendering 17 David Luebke

Fragment processor: texture mapping

• Texture reads are just another instruction

• Allows computed texture coordinates, nested to arbitrary depth

– This is a big difference w/ NVIDIA and ATI right now

• Allows multiple uses of a single texture unit

• Optional LOD control – can specify filter extent

• Think of it as a memory-read instruction, with optional user-controlled filtering

18 Real-Time Rendering David Luebke

Fragment processor capabilities

• Dynamic branching

• Conditional fragment-kill instruction

• Read access to window-space position

• Read/write access to fragment Z (but not stencil)

• Multiple render targets

• Built-in derivative instructions

– Partial derivatives w.r.t. screen-space x or y

– Useful for anti-aliasing shaders

• FP32, FP16, and fixed-point data

Real-Time Rendering 19 David Luebke

Fragment processor limitations

• Dynamic branching less efficient than vertex proc.

– Especially for non-coherent branching (<~ 30x30 pixels)

– Can do a lot with condition codes

• No indexed reads from registers

– I.e., no indexed arrays

– Must use texture reads instead

• No arbitrary memory write

20 David Luebke Real-Time Rendering

Fragment processor resources

• 65535+ instructions

• Nearly unlimited constants

– Each constant counts as one instruction

• 16 texture units (NV30, still?), reuse as often as desired

• 10 FP32 x 4 perspective-correct inputs (e.g. tex coords)

• Up to 4 128-bit framebuffer “color” outputs

– Can pack as 4 x FP32, 8 x FP16, etc…)

• Can also set the depth output

– 24 or 32 bits, depending on stencil

– Changing depth in fragment program may disable Z-optimizations

Real-Time Rendering 21 David Luebke

GPU vendor differences

• Note: this slide will be dated almost instantly

• NVIDIA: as described in previous slides

• ATI hardware today (1900XT current high-end part):

– No vertex texture fetch (but good render-to-vertex-array)

– Far fewer levels of computed texture coordinates

– Better at fine-grained (less coherent) dynamic branching

• ATI Xenos (Xbox 360 chip):

– Unified shader model: vertex proc == pixel proc

– Scatter support: shaders can write arbitrary memory loc

Real-Time Rendering 22 David Luebke

Cg – “C for Graphics”

• Cg is a high-level GPU programming language

• Designed by NVIDIA and Microsoft

• Competes with the (quite similar)

GL Shading Language, a.k.a GLslang

Real-Time Rendering 23 David Luebke

Programming in assembly is painful

Assembly

FRC R2.y, C11.w;

ADD R3.x, C11.w, -R2.y;

MOV H4.y, R2.y;

ADD H4.x, -H4.y, C4.w;

MUL R3.xy, R3.xyww, C11.xyww;

ADD R3.xy, R3.xyww, C11.z;

TEX H5, R3, TEX2, 2D;

ADD R3.x, R3.x, C11.x;

TEX H6, R3, TEX2, 2D;

Cg

L2weight = timeval – floor(timeval);

L1weight = 1.0 – L2weight; ocoord1 = floor(timeval)/64.0 +

1.0/128.0; ocoord2 = ocoord1 + 1.0/64.0;

L1offset = f2tex2D(tex2, float2(ocoord1, 1.0/128.0));

L2offset = f2tex2D(tex2, float2(ocoord2, 1.0/128.0));

• Easier to read and modify

• Cross-platform

• Combine pieces

24

• etc.

David Luebke Real-Time Rendering

Some points in the design space

• CPU languages

– C – close to the hardware; general purpose

– C++, Java, lisp – require memory management

– RenderMan – specialized for shading

• Real-time shading languages

– Stanford shading language

– Creative Labs shading language

Real-Time Rendering 25 David Luebke

Design strategy

• Start with C (and a bit of C++)

– Minimizes number of decisions

– Gives you known mistakes instead of unknown ones

• Allow subsetting of the language

• Add features desired for GPU’s

– To support GPU programming model

– To enable high performance

• Tweak to make it fit together well

26 Real-Time Rendering David Luebke

How are GPUs different from CPUs?

1. GPU is a stream processor

– Multiple programmable processing units

– Connected by data flows

Application

Vertex

Processor

Fragment

Processor

Textures

Cg separates vertex & fragment programs

Application

Vertex

Processor

Program

Real-Time Rendering 28

Fragment

Processor

Textures

Program

David Luebke

Cg programs have two kinds of inputs

Varying inputs (streaming data)

– e.g. normal vector – comes with each vertex

– This is the default kind of input

Uniform inputs (a.k.a. graphics state)

– e.g. modelview matrix

• Note: Outputs are always varying vout MyVertexProgram( float4 normal , uniform float4x4 modelview ) {

Binding VP outputs to FP inputs

a) Let compiler do it

– Define a single structure

– Use it for vertex-program output

– Use it for fragment-program input struct vout { float4 color;

}; float4 texcoord;

Binding VP outputs to FP inputs

b) Do it yourself

– Specify register bindings for VP outputs

– Specify register bindings for FP inputs

– May introduce HW dependence

– Necessary for mixing Cg with assembly struct vout { float4 color : TEX3 ;

}; float4 texcoord : TEX5 ;

Some inputs and outputs are special

• E.g. the position output from vert prog

– This output drives the rasterizer

– It must be marked struct vout { float4 color;

}; float4 texcoord; float4 position : HPOS ;

Download