Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007

advertisement
Graphics Hardware
Kurt Akeley
CS248 Lecture 14
8 November 2007
http://graphics.stanford.edu/courses/cs248-07/
Implementation = abstraction (from lecture 2)
Application
Application
Data Assembler
Vtx Thread Issue
Vertex assembly
Setup / Rstr / ZCull
Prim Thread Issue
Frag Thread Issue
SP
SP
TF
SP
SP
TF
L1
SP
SP
TF
L1
SP
SP
SP
SP
TF
TF
L1
SP
TF
L1
L1
SP
SP
SP
TF
L1
SP
SP
TF
L1
L1
Thread Processor
Vertex operations
Primitive assembly
Primitive operations
Rasterization
Fragment operations
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
NVIDIA GeForce 8800
CS248 Lecture 14
Source : NVIDIA
L2
FB
Framebuffer
OpenGL Pipeline
Kurt Akeley, Fall 2007
Correspondence (by color)
Applicationprogrammable
parallel
processor
Fixed-function assembly
processors
Application
Application
Data Assembler
this was missing
Setup / Rstr / ZCull
Vtx Thread Issue
Prim Thread Issue
Frag Thread Issue
Vertex assembly
SP
SP
TF
SP
SP
TF
L1
SP
SP
TF
L1
SP
SP
SP
SP
TF
TF
L1
SP
TF
SP
SP
TF
SP
SP
TF
Fixed-function
L1
L1
framebuffer
operations
L1
L1
SP
L1
Thread Processor
Vertex operations
Primitive assembly
Primitive operations
Rasterization
(fragment assembly)
Fragment operations
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
NVIDIA GeForce 8800
CS248 Lecture 14
L2
FB
Framebuffer
OpenGL Pipeline
Kurt Akeley, Fall 2007
Why does graphics hardware exist?
Special-purpose hardware tends to disappear over time

Lisp machines and CAD workstations of the 80s

CISC CPUs
iAPX432
(circa 1982)
www.dvorak.org/blog/
CS248 Lecture 14
Symbolics Lisp Machines
(circa 1984)
www.abstractscience.freeserve.co.uk/symbolics/photos/
Kurt Akeley, Fall 2007
Why does graphics hardware exist?
Graphics acceleration has been around for 40 years.
Why do GPUs remain? Confluence of four things:

Performance differentiation


Work-load sufficiency



GPUs are much faster than CPUs at 3-D rendering tasks
The accelerated 3-D rendering tasks make up a significant
portion of the overall processing (thus Amdahl’s law doesn’t
limit the resulting performance increase).
Strong market demand

Customer demand for 3-D graphics performance is strong

Driven by the games market
Ubiquity

With the help of standardized APIs/architectures (OpenGL and
Direct3D) GPUs have achieved ubiquity in the PC market

Inertia now works in favor of continued graphics hardware
CS248 Lecture 14
Kurt Akeley, Fall 2007
NVIDIA 8800 Ultra
Stream processors
128
Peak floating-point performance
400+ GFLOPS
Memory
768 MB
Memory bandwidth
103.7 GB/sec
Triangle rate (vertex rate)
306 million/sec (est)
Texture fill rate (fragment rate)
39.2 billion/sec
CS248 Lecture 14
Kurt Akeley, Fall 2007
NVIDIA performance trends
Year
Product
Tri rate
CAGR
Tex rate
CAGR
1998
Riva ZX
3m
-
100m
-
1999
Riva TNT2
9m
3.0
350m
3.5
2000
GeForce2 GTS
25m
2.8
664m
1.9
2001
GeForce3
30m
1.2
800m
1.2
2002
GeForce Ti 4600
60m
2.0
1200m
1.5
2003
GeForce FX
167m
2.8
2000m
1.7
2004
GeForce 6800 Ultra
170m
1.0
6800m
2.7
2005
GeForce 7800 GTX
215m
1.2
6800m
1.0
2006
GeForce 7900 GTX
260m
1.3
15600m
2.3
2007
GeForce 8800 Ultra
306m
1.2
39200m
2.5
1.7
1.9
Yearly Growth is well above 1.5 (Moore’s Law)
CS248 Lecture 14
Kurt Akeley, Fall 2007
SGI performance trends (depth buffered)
Year
Product
1984
Iris 2000
1988
GTX
1992
RealityEngine
1996
InfiniteReality
ZTri rate
1k
135k
CAGR
Zbuf rate
CAGR
-
100k
-
3.6
40m
4.5
2m
2.0
380m
1.8
12m
1.6
1000m
1.3
2.2
2.2
Yearly Growth well above 1.5 (Moore’s Law)
CS248 Lecture 14
Kurt Akeley, Fall 2007
CPU performance CAGR has been slowing
CS248 Lecture 14
Source: Hennessy and Patterson
Kurt Akeley, Fall 2007
The situation could change …
CPUs are becoming much more parallel

CPU performance increase (1.2x to 1.5x per year) is low
compared with the GPU increase (1.7x to 2x per year).

This could change now with CPU parallelism (many-core)
The vertex pipeline architecture is getting old

Approaches such as ray tracing offer many advantages, but
the vertex pipeline is poorly optimized for them

The work-load argument is somewhat circular, because the
brute-force algorithms employed by GPUs inflate their own
performance demands
GPUs have and will continue to evolve

But a revolution is always possible
CS248 Lecture 14
Kurt Akeley, Fall 2007
Outline
The rest of this lecture is organized around the four ideas that
most informed the design of modern GPUs (as enumerated by
David Blythe in this lecture’s reading assignment):

Parallelism

Coherence

Latency

Programmability
I’ll continue to use the NVIDIA 8800 as a specific example
CS248 Lecture 14
Kurt Akeley, Fall 2007
Parallelism
CS248 Lecture 14
Kurt Akeley, Fall 2007
Graphics is “embarrassingly parallel”
Many separate
tasks (the types
I keep talking
about)
struct {
float x,y,z,w;
float r,g,b,a;
} vertex;
struct {
vertex v0,v1,v2
} triangle;
No “horizontal”
dependencies,
few “vertical”
(in-order
execution)
CS248 Lecture 14
struct {
short int x,y;
float depth;
float r,g,b,a;
} fragment;
struct {
int depth;
byte r,g,b,a;
} pixel;
Application
Vertex assembly
Vertex operations
Primitive assembly
Primitive operations
Rasterization
Fragment operations
Framebuffer
Display
Kurt Akeley, Fall 2007
Data and task parallelism
Data Parallelism
Application
Data parallelism
Simultaneously doing the
same thing to similar data

E.g., transforming vertexes

Some variance in “same
thing” is possible
Task parallelism


Simultaneously doing
different things
E.g., the tasks (stages) of the
vertex pipeline
Vertex assembly
Vertex operations
Task Parallelism

Primitive assembly
Primitive operations
Rasterization
Fragment operations
Framebuffer
Display
CS248 Lecture 14
Kurt Akeley, Fall 2007
Trend from pipeline to data parallelism
Coord, normal
Transform
Coordinate
Transform
Lighting
Command
Processor
Clip testing
6-plane
Clipping state
Frustum
Clipping
Divide by w
(clipping)
Viewport
Divide by w
Prim. Assy.
Viewport
Clark “Geometry Engine”
CS248 Lecture
(1983)14
Round-robin
Aggregation
Backface cull
SGI 4D/GTX
(1988)
SGI RealityEngine
Kurt Akeley,
Fall 2007
(1992)
Load balancing
Easy for data parallelism
Challenging for task parallelism

Static balance is difficult to achieve

But is insufficient

Mode changes affect execution time
(e.g., complex lighting)

Worse, data can affect execution
time (e.g., clipping)
Unified architectures ease pipeline balance

Pipeline is virtual, processors
assigned as required

8800 is unified
Application
Vertex assembly
Vertex operations
Primitive assembly
Primitive operations
Rasterization
Fragment operations
Framebuffer
Display
CS248 Lecture 14
Kurt Akeley, Fall 2007
Unified pipeline architecture
Applicationprogrammable
parallel
processor
Application
Application
Data Assembler
this was missing
Setup / Rstr / ZCull
Vtx Thread Issue
Prim Thread Issue
Frag Thread Issue
Vertex assembly
SP
SP
TF
SP
SP
TF
L1
SP
SP
TF
L1
SP
SP
SP
SP
TF
TF
L1
SP
TF
L1
L1
SP
SP
SP
TF
L1
SP
SP
TF
L1
L1
Thread Processor
Vertex operations
Primitive assembly
Primitive operations
Rasterization
(fragment assembly)
Fragment operations
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
NVIDIA GeForce 8800
CS248 Lecture 14
L2
FB
Framebuffer
OpenGL Pipeline
Kurt Akeley, Fall 2007
Queueing
FIFO buffering (first-in, first-out) is
provided between task stages


Accommodates variation in
execution time
Provides elasticity to allow unified
load balancing to work
FIFOs can also be unified


Share a single large memory with
multiple head-tail pairs
Application
Vertex assembly
FIFO
Vertex operations
FIFO
Primitive assembly
Allocate as required
FIFO
CS248 Lecture 14
Kurt Akeley, Fall 2007
In-order execution
Work elements must be sequence stamped
Can use FIFOs as reorder buffers as well
CS248 Lecture 14
Kurt Akeley, Fall 2007
Coherence
CS248 Lecture 14
Kurt Akeley, Fall 2007
Two aspects of coherence
Data locality

The data required for computation are “near by”
Computational coherence

Similar sequences of operations are being performed
CS248 Lecture 14
Kurt Akeley, Fall 2007
Data locality
Prior to texture mapping:
Application

Vertex pipeline was a stream processor
Vertex assembly

Each work element (vertex, primitive,
fragment) carried all the state it needed
Vertex operations
Modal state was local to the pipeline
stage
Primitive assembly



Assembly stages operated on adjacent
work elements
Data locality was inherent in this model
Post texture mapping:

All application-programmable stages
have memory access (and use them)


Primitive operations
So the vertex pipeline is no longer a
stream processor
Data locality must be fought for …
CS248 Lecture 14
Rasterization
Fragment operations
Framebuffer
Display
Kurt Akeley, Fall 2007
Post-texture mapping data locality
(simplified)
Modern memory (DRAM) operates in large
blocks

Memory is a 2-D array

Access is to an entire row
To make efficient use of memory bandwidth
all the data in a block must be used
Two things can be done:


Aggregate read and write requests

Memory controller and cache

Complex part of GPU design
Organize memory contents
coherently (blocking)
CS248 Lecture 14
Kurt Akeley, Fall 2007
Texture Blocking
6D Organization
Cache Size
4x4 blocks
Cache Line Size
4x4 texels
(s1,t1)
(s2,t2)
(s3,t3)
Address
CS248 Lecture 14
base
s1
t1
Source: Pat Hanrahan
s2
t2
s3
t3
Kurt Akeley, Fall 2007
Computational coherence
Data parallelism is computationally coherent

Simultaneously doing the same thing to similar data

Can share a single instruction sequencer with multiple data
paths:
struct {
float x,y,z,w;
float r,g,b,a;
} vertex;
Instruction
fetch and
execute
struct {
float x,y,z,w;
float r,g,b,a;
} vertex;
SIMD – Single Instruction Multiple Data
CS248 Lecture 14
Kurt Akeley, Fall 2007
SIMD processing
One of eight
16-wide SIMD
processors
SP
SP
Why not use
L1
one 128-wide
processor?
TF
SP
SP
TF
Data Assembler
this was missing
Setup / Rstr / ZCull
Vtx Thread Issue
Prim Thread Issue
Frag Thread Issue
SP
TF
L1
SP
SP
SP
TF
SP
SP
TF
L1
L2
FB
SP
TF
L1
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Application
L2
FB
NVIDIA GeForce 8800
CS248 Lecture 14
Kurt Akeley, Fall 2007
SIMD conditional control flow
The “shader” abstraction operates on each data element
independently
But SIMD implementation shares a single execution unit across
multiple data elements
If data elements in the same SIMD unit branch differently the
execution unit must follow both paths (sequentially)
The solution is predication:

Both paths are executed

Data paths are enabled only during their selected path

Can be nested

Performance is obviously lost!
SIMD width is a compromise:

Too wide  too much performance loss due to predication

Too narrow  inefficient hardware implementation
CS248 Lecture 14
Kurt Akeley, Fall 2007
Latency
CS248 Lecture 14
Kurt Akeley, Fall 2007
Again two issues
Overall rendering latency

Typically measured in frames

Of concern to application programmers

Short on modern GPUs (more from Dave Oldcorn on this)

But GPUs with longer rendering latencies have been designed

Fun to talk about in a graphics architecture course
Memory access latency

Typically measured in clock cycles (and reaching thousands
of those)

Of direct concern to GPU architects and implementors

But useful for application programmers to understand too!
CS248 Lecture 14
Kurt Akeley, Fall 2007
Multi-threading
Another kind of processor virtualization

Unified GPUs share a single execution engine among
multiple pipeline (task) stages


Equivalent to CPU multi-tasking
Multi-threading shares a single execution engine among
multiple data-parallel work elements

Similar to CPU hyper-threading
The 8800 Ultra multi-threading mechanism is used to support both
multi-tasking and data-parallel multi-threading
A thread is a data structure:
struct {
int pc;
// program counter
float reg[n]; // live register state
enum ctxt;
// context information
…
} thread;
CS248 Lecture 14
More live registers
mean more memory
usage
Kurt Akeley, Fall 2007
Multi-threading
SP
SP
TF
SP
SP
this was missing
Setup / Rstr / ZCull
Vtx Thread Issue
Prim Thread Issue
Frag Thread Issue
SP
TF
L1
SP
SP
SP
SP
SP
TF
TF
L1
L2
FB
SP
SP
SP
SP
Programmability
TF
L1
Data Assembler
FB
L1
L1
L2
TF
L1
L2
FB
TF
L1
L2
FB
SP
SP
TF
L1
L2
FB
Thread Processor
Application
L2
FB
NVIDIA GeForce 8800
CS248 Lecture 14
Kurt Akeley, Fall 2007
Multi-threading hides latency
Memory data available
(dependency resolved)
Memory reference
(or resulting data
dependency)
struct {
float x,y,z,w;
float r,g,b,a;
} vertex;
CS248 Lecture 14
Blocked
Threads
Ready
to
Run
Threads
Processor stalls if no
threads
structare
{ ready to run.
Instruction
float result
x,y,z,w;
Possible
of large
fetch and
float
r,g,b,a;
thread
context
(too many
execute
} vertex;
live registers)
Kurt Akeley, Fall 2007
Cache and thread store
CPU

Uses cache to hide memory latency

Caches are huge (many MBs)
GPU

Uses cache to aggregate memory requests and maximize
effective bandwidth


Caches are relatively small
Uses multithreading to hide memory latency

Thread store is large
Total memory usage on CPU and GPU chips is becoming similar …
CS248 Lecture 14
Kurt Akeley, Fall 2007
Programmability
CS248 Lecture 14
Kurt Akeley, Fall 2007
Programmability trade-offs
Fixed-function:

Efficient in die area and power dissipation

Rigid in functionality

Simple
Programmable:

Wasteful of die area and power

Flexible and adaptable

Able to manage complexity
CS248 Lecture 14
Kurt Akeley, Fall 2007
Programmability is not new
The Silicon Graphics VGX (1990) supported
programmable vertex, primitive, and
fragment operations.

These operations are complex and
require flexibility and adaptability

The assembly operations are
relatively simple and have few
options

Texture fetch and filter are also
simple and benefit from fixedfunction implementation
Application
Vertex assembly
Vertex operations
Primitive assembly
Primitive operations
Rasterization
Fragment operations
What is new is allowing application
developers to write vertex, primitive,
and fragment shaders
CS248 Lecture 14
Framebuffer
OpenGL Pipeline
Kurt Akeley, Fall 2007
Questions
CS248 Lecture 14
Kurt Akeley, Fall 2007
Why insist on in-order processing?
Even Direct3D 10 does
Testability (repeatability)
Invariance for multi-pass rendering (repeatability)
Utility of painter’s algorithm
State assignment!
CS248 Lecture 14
Kurt Akeley, Fall 2007
Why can’t fragment shaders access the framebuffer?
Equivalent to: why do other people’s block
diagrams distinguish between fragment
operations and framebuffer operations?
Application
Vertex assembly
Vertex operations
Simple answer: cache consistency
Primitive assembly
Primitive operations
Rasterization
Fragment operations
Framebuffer
OpenGL Pipeline
CS248 Lecture 14
Kurt Akeley, Fall 2007
Why hasn’t tiled rendering caught on?
It seems very attractive:

Small framebuffer (that can be on-die in some cases)

Deep framebuffer state (e.g., for transparency sorting)

High performance
Problems:

May increase rendering latency

Has difficulty with multi-pass algorithms

Doesn’t match the OpenGL/Direct 3D abstraction
CS248 Lecture 14
Kurt Akeley, Fall 2007
Summary
Parallelism

Graphics is inherently highly data and task parallel

Challenges include in-order execution and load balancing
Coherence

Streaming is inherently data and instruction coherent

But texture fetch breaks streaming model / data coherence

Reference aggregation and memory layout restore data coherence
Latency

Modern GPU implementations have minimal rendering latency

Multithreading (not caching) hides (the large) memory latency
Programmability

“Operation” stages are (and have long been) programmable


Assembly stages, texture filtering, and ROPs typically are not
Application programmability is new
CS248 Lecture 14
Kurt Akeley, Fall 2007
Assignments
Next lecture: Performance Tuning and Debugging
(guest lecturer Dave Oldcorn, AMD)
Reading assignment for Tuesday’s class:

Sections 2.8 (vertex arrays) and 2.9 (buffer objects) of the
OpenGL 2.1 specification
Short office hours today
CS248 Lecture 14
Kurt Akeley, Fall 2007
End
CS248 Lecture 14
Kurt Akeley, Fall 2007
Download