NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall

advertisement
NVIDIA GeForce
Ryan Hendrixson
Ryan Schubert
Allison Walthall
What Does a GPU Actually Do?

Historically, from:
– Acting simply as a frame buffer
– Doing vertex transformations and pixel color
calculations
– Now even programmable

In the simplest sense, a modern GPU
implements a 3D rendering pipeline
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
This is a pipelined
sequence of operations
to draw a 3D primitive
into a 2D image
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Viewing
Transformation
Projection
Transformation
Clipping
Scan
Conversion
Image
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Transform into 3D camera coordinate system
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
Clip primitives outside camera’s view
3D Rendering Pipeline (direct illumination)
3D Geometric Primitives
Modeling
Transformation
Lighting
Transform into 3D world coordinate system
Illuminate according to lighting and reflectance
Viewing
Transformation
Transform into 3D camera coordinate system
Projection
Transformation
Transform into 2D screen coordinate system
Clipping
Scan
Conversion
Image
Clip primitives outside camera’s view
Draw pixels
Modern OpenGL Pipeline
Graphics State
Vertex
Vertex
Processor
Processor
Application
Vertices
(3D)
CPU
Assembly
& Rasterization
Xformed,
Lit
Vertices
(2D)
Fragments
(pre-pixels)
GPU


Pixel
Pixel
Processor
Processor
Final
pixels
(Color, Depth)
Video
Memory
(Textures)
Render-to-texture
Programmable Vertex Processor
Programmable Fragment (Pixel) Processor
OpenGL vs. DirectX
 Just
graphics
 Standard C
interfaces
 State machine
 Multiple platforms
 Academic use
 Graphics,
multimedia, etc.
 C++ interfaces
 Object oriented
 Windows
 PC games
Possible GPU Performance
Bottlenecks

CPU/Bus Bound
– Simply not able to send enough vertices to the card
to keep it busy

Vertex Bound
– Vertex processing engine is fully loaded, while the
fragment engine is just waiting and grabbing data as
soon as it’s ready

Pixel Bound
– The fragment engine is fully loaded, causing the
vertex engine to have to wait before sending more
data
Early History
NVIDIA founded in 1993
 1997: RIVA
 1998: RIVA TNT
 1999: GeForce 256 (NV10)

GeForce 256 (NV10)
Lighting and transformation
 DDR and SDR
 HDTV compliant
 Hardware alpha-blending
 4 pixel pipelines at 120 MHz
 Fill Rate: 480 Megapixels/second

GeForce2

2000: GeForce 2 GTS:
–
–
–
–
–
Doubled the pixel fill rate
Quadrupled the texel fill rate
Increased clock speed
Multi-texturing
S3TC, MPEG-2, FSAA
Anti-Aliasing
Without Anti-Aliasing
With Anti-Aliasing
GeForce2
 2000:
GeForce 2 MX
– Cut pixel pipeline by 2, making it cost
effective
– Twinview
– Compatible with MACs
GeForce2
Jan 2001: Apple selected GeForce2
MX as default high-end graphics
solution for Power Mac G4
 August 2000: GeForce2 Ultra
 November 2000: GeForce2 Go
 December 2000: NVIDIA buys 3DFX

GeForce3
2001: GeForce3 (NV20)

–
–
–
–
–
–
240 MHz Core/500 MHz Memory
57 million transistors
46-76 Gigaflops
Vertex shader technology
Pixel shader technology
LightSpeed Memory architecture
LightSpeed Memory Architecture
GeForce4
2002: GeForce4 Ti (NV25) and MX (NV17)

–
Ti:
 4200, 4400, 4600, and 4800 versions
 63 million transistors
 Chip clock 225-300 MHz
 Memory Clock 500-650 MHz
 75-100 million vertices/second
GeForce FX
November 2002: Geforce FX (NV30)

–
–
–
–
–
–
16 variations for different price ranges
125 million transistors
8 pixels/clock
1 tmu/pipe (16 textures/unit)
128 bit memory interface
128 MB/256 MB Memory size support
GeForce 6 series
GeForce 6 series (NV40 )

–
–
–
–
–
6200; 6600 GT and Ultra; 6800 GT,
Ultra, and Ultra Extreme
Core clock speed 450 MHz
Memory clock speed 600 MHz
6 4-wide fp32 vector MADDs/ clock cycle
vertex shader units
16 4-wide fp32 vector MADDs/ clock
cycle pixel shader units
GeForce 6 series
 Super
scalar 16 pipe architecture
 CineFX3.0 engine
 All operations done in FP32
precision per component
 200 Gigaflops (Compare this to
the Itanium’s 6.4 Gigaflops)
General Diagram (6800/NV40)
TurboCache
Uses PCI-Express bandwidth to render
directly to system memory
 Card needs less memory
 Performance boost while lowering cost
 TurboCache Manager dynamically allocates
from main memory
 Local memory used to cache data and to
deliver peak performance when needed

TurboCache
NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four
FP32 components), one scalar FP32 operation, and make one access to the
texture per clock cycle
NV40 Fragment Processors
Early termination from mini z buffer and z
buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Programmable 2D and Video
Processor

Can be used for video decoding and
coding (IDCT, deinterlacing, color model
transformations, etc.)
Why NV40 series was better
Massive parallelism
Scalability


–
Computation Power

–
–

Lower end products have fewer pixel pipes
and fewer vertex shader units
222 million transistors
First to comply with Microsoft’s DirectX 9
spec
Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shading
 Instruction flow handled in groups of
pixels
 Specify branch granularity (the number of
consecutive pixels that take the same
branch)
 Better distribution of blocks of pixels
between the different quad engines

Dynamic Branching
GeForce 7 series
7800 GT

$449

7 vertex units

20 pixel pipelines

Clock speed 400
MHz

Memory clock
speed 500 MHz





7800 GTX
$600
8 vertex units
24 pixel pipelines
Clock speed 430
MHz
Memory clock
speed 600 MHz
GeForce 7800
302 million transistors
 200 Gigaflops of multiply/add calculations
per second
 128-bit floating point precision through
the entire rendering pipeline
 Fill Rate: 10.3 Gigatexels
 860 million vertices/sec
GeForce 7800
ALU Units in Pixel Processor

Sub-unit 1:
– NV40: textures data and can issue a MUL
vector instruction or use its mini-ALU to issue
a non-vector instruction
– G70: same but also can issue a multiply/add

Sub-unit 2:
– NV40: can issue a multiply/add vector
instruction or use its own mini-ALU to issue a
non-vector instruction
– G70: same
GeForce 6 vs. GeForce 7

ALU Units
– G70: 24 ALU Units
– NV40: 16 ALU Units
Register file: same size
 Texture samplers the same but when
fetching large textures in preparation for
filtering, G70's samplers have less latency
pulling those textures out of memory

GeForce 6 vs. GeForce 7
(speculative)
Increased L2 texture cache (to around
12KB)
 Better cache re-use with larger textures,
decompressing those larger textures into
L1 faster
 Possibly offering more granularity in cache
access by the GPU, to reduce texture
bandwidth, speeding up rendering.

GeForce 6 vs. GeForce 7
33 % more vertex units, each with more
performance
 Improved vertex fetch unit (unconfirmed
by Nvidia)
 Triangle setup and rasteriser optimized via
the use of a new raster pattern (again
unconfirmed by Nvidia)

General Diagram (7800/G70)
32-bit IEEE floating-point
throughout pipeline (NV40)
Framebuffer
 Textures
 Fragment processor
 Vertex processor
 Interpolants
 GeForce 7800 (G70) supports 128 bit
through entire pipeline!

Hardware supports several other
data types

Fragment processor also supports:
– 16-bit “half” floating point
– 12-bit fixed point
– These may be faster than 32-bit on some HW

Framebuffer/textures also support:
– Large variety of fixed-point formats
– E.g., classical 8-bit per component
– These formats use less memory bandwidth than FP32
How are current GPU’s different
from CPU?
GPU is a stream processor
Multiple programmable processing units
Connected by data flows
Textures
Framebuffer
Fragment
Processor
Framebuffer
Operations
Vertex
Processor
Assembly &
Rasterization
Application
How are current GPU’s different
from CPU?
Optimized for 4-vector arithmetic
– Useful for graphics – colors, vectors,
texcoords
– Easy way to get high performance/cost
– SIMD/MIMD
GPU Memory Model vs CPU’s

Much more restricted memory access
– Allocate/free memory only before computation
– Limited memory access during computation (kernel)
 Registers
– Read/write
 Local memory
– Does not exist
 Global memory
– Read-only during computation
– Write-only at end of computation (pre-computed
address)
 Disk access
– Does not exist
GPU Memory Model

Where is GPU Data Stored?
– Vertex buffer
– Frame buffer
– Texture
VS 3.0 GPUs
Texture
Vertex Buffer
Vertex
Processor
Rasterizer
Fragment
Processor
Frame
Buffer(s)
GPGPU and Motivation

GPUs are fast…
– Itanium: 6.4 GFLOPS
– GeForceFX 7800: 200 GFLOPs
– GPUs are getting faster, faster
– CPUs: annual growth  1.5×  decade growth
 60×
– GPUs: annual growth > 2.0×  decade
growth > 1000
Motivation:
Computational Power GPU
GPU
CPU
Courtesy Naga Govindaraju
GPGPU
Good for inherently parallel applications
 Rapidly evolving ISA and HW architecture

– Largely secret

Can’t simply “port” code written for the
CPU!
Programs are Shaders

Bound by the specific hardware profile:
– E.g. different cards have different supported
hardware, OpenGL has different restrictions than
DirectX, etc

Hardware profiles change relatively drastically as
new GPUs are developed
– But typically new profiles only add features, so there
is generally still backwards compatibility (but not
always)
Vertex processor

256 instructions per program originally
(effectively higher with branching)
– Now up to 65535 instructions
Executes on all vertices
 Outputs new vertices or texture
coordinates, etc

Fragment Processor Flow Chart
Fragment processor has
flexible texture mapping
Memory is accessible through texture
reads
 Texture reads are just another instruction
 Allows computed texture coordinates,
nested to arbitrary depth
 Allows multiple uses of a single
texture unit

Additional fragment processor
capabilities
Read access to window-space position
 Read/write access to fragment Z
 Built-in derivative instructions

– Partial derivatives w.r.t. screen-space x or y
– Useful for anti-aliasing
Conditional fragment-kill instruction
 Multiple FP formats supported

Fragment processor limitations

Originally No branching
– Now support dynamic branching (but it’s still
costly)

No indexed reads from registers
– Use texture reads instead

No memory writes
Branching Instruction Costs
(GeForce 6800)
Fragment shaders
Originally very limited in size (only 96
instructions), now expanded to 65535+
instructions
 New cards support dynamic branching (but it
still incurs some performance penalty)
 Now have the ability to output to multiple render
targets

CineFX 4.0 Engine
A redesigned vertex shader unit reduces the
time to set up and perform geometry
processing.
 A new pixel shader unit design can carry out
twice as many floating-point operations and
greatly accelerates other mathematical
operations to increase throughput.
 An advanced texture unit incorporates new
hardware algorithms and better caching to
speed filtering and blending operations.

Vertex Shaders




The 7800 has 8 vertex
shaders
The Triangle Setup stage
turns the vertex points
into a triangle
It also determines
mathmatically the
rasterization for each
triangle
Accelerating triangle
setup increases the total
throughput of the 3D
pipeline
Theoretical Rasterization Pattern of a
Triangle
New Pixel Shader – MADD
Multiply and Accumulate are commonly
used math functions in 3D graphics
 MADD stands for Multiply-ADD operations
 The 7800 can do twice the amount of
MADD operations than previous GPUs
could
 This allows developers to create much
more complex visual effects

Transparency Adaptive
Supersampling
Takes extra passes of thin-lined objects
such as chain linked fences or trees to
enhance quality
 Pixels inside of a polygon are usually not
touched by anti-aliasing methods
 With this, a key set is devised, and those
pixels are anti-aliased, creating a
smoother image.

Transparency Adaptive
Supersampling
Transparency Adaptive
Multisampling
Higher levels of performance, because it
uses one texel to determine other subpixel
values
 Not as high quality

Supporting the Future

The 7800 is already set up to support the
new Microsoft Longhorn OS with some of
the following advancements
– Video post-processing
– Real-time desktop compositing
– Seamless multiple 3D applications
– Accelerated antialiased text rendering
– Special effects and animation
Accelerated Graphics Port (AGP)
The AGP is superior to the PCI because it
provides a dedicated pathways between the slot
and the processor
 Uses sideband addressing
 PCI must load a texture from the hard drive into
the systems RAM, then from the RAM into the
GNU framebuffer
 AGP can read textures directly from system RAM
by “tricking” the CPU into believing the textures
are in the framebuffer, when they are really in
memory

PCI Express



Based on the PCI system,
allowing for backwards
compatibility
Uses 1 bit, bi-directional
lanes (PCI used a bus)
Each lane can support
250 MB/s in each lane
(4GB/s total)
– AGP is only 2 GB/s
Scalable Link Interface (SLI)
Takes advantage of the PCI express bus,
which will allow more than one discrete
graphics device on the same PCI host
 Allows two of the same GeForce GPUs to
run on one machine, thus “sharing” load.
 There are two modes for this

– Split-frame Rendering (SFR)
– Alternate-frame Rendering (AFR)
Split-frame Rendering



Has each GPU render a
portion of the screen,
split horizontally
No extra latency
Not necessarily evenly
split
– SFR is load shared, so it
splits up the frame by the
amount of work, not the
size

A large amount of
overhead is involved,
causing a max speed up
of around 1.8 times
Alternate-frame Rendering





Avoids all the
overhead problems of
SFR
Many buffer swaps
Reliant on the speed
of the processor
Can cause latency
issues
Recommended mode
by NVIDIA
GeForce Go 7800 GTX




The mobile version of the 7800
GTX
Everything from the desktop
release has been carried over
to this
Can switch between x1 and
x16 lanes of PCI Express
Uses PowerMizer 6.0, which
allows this chip to operate in
the same envelope as it’s
predecessor, the 6800
GeForce Go 7800 – Power Issues





Power consumption and package are the same as the 6800 Ultra
chip, meaning notebook designers do not have to change very much
about their thermal designs
Dynamic clock scaling can run as slow as 16 MHz
– This is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop version
Runs at voltages lower than any other mobile performance part
Regardless, you won’t get much battery-based runtime for a 3D
game
Questions?
Download