NVIDIA GeForce Ryan Hendrixson Ryan Schubert Allison Walthall What Does a GPU Actually Do? Historically, from: – Acting simply as a frame buffer – Doing vertex transformations and pixel color calculations – Now even programmable In the simplest sense, a modern GPU implements a 3D rendering pipeline 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image This is a pipelined sequence of operations to draw a 3D primitive into a 2D image 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3D world coordinate system 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3D world coordinate system Illuminate according to lighting and reflectance 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Viewing Transformation Projection Transformation Clipping Scan Conversion Image Transform into 3D world coordinate system Illuminate according to lighting and reflectance Transform into 3D camera coordinate system 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Transform into 3D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3D camera coordinate system Projection Transformation Transform into 2D screen coordinate system Clipping Scan Conversion Image 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Transform into 3D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3D camera coordinate system Projection Transformation Transform into 2D screen coordinate system Clipping Scan Conversion Image Clip primitives outside camera’s view 3D Rendering Pipeline (direct illumination) 3D Geometric Primitives Modeling Transformation Lighting Transform into 3D world coordinate system Illuminate according to lighting and reflectance Viewing Transformation Transform into 3D camera coordinate system Projection Transformation Transform into 2D screen coordinate system Clipping Scan Conversion Image Clip primitives outside camera’s view Draw pixels Modern OpenGL Pipeline Graphics State Vertex Vertex Processor Processor Application Vertices (3D) CPU Assembly & Rasterization Xformed, Lit Vertices (2D) Fragments (pre-pixels) GPU Pixel Pixel Processor Processor Final pixels (Color, Depth) Video Memory (Textures) Render-to-texture Programmable Vertex Processor Programmable Fragment (Pixel) Processor OpenGL vs. DirectX Just graphics Standard C interfaces State machine Multiple platforms Academic use Graphics, multimedia, etc. C++ interfaces Object oriented Windows PC games Possible GPU Performance Bottlenecks CPU/Bus Bound – Simply not able to send enough vertices to the card to keep it busy Vertex Bound – Vertex processing engine is fully loaded, while the fragment engine is just waiting and grabbing data as soon as it’s ready Pixel Bound – The fragment engine is fully loaded, causing the vertex engine to have to wait before sending more data Early History NVIDIA founded in 1993 1997: RIVA 1998: RIVA TNT 1999: GeForce 256 (NV10) GeForce 256 (NV10) Lighting and transformation DDR and SDR HDTV compliant Hardware alpha-blending 4 pixel pipelines at 120 MHz Fill Rate: 480 Megapixels/second GeForce2 2000: GeForce 2 GTS: – – – – – Doubled the pixel fill rate Quadrupled the texel fill rate Increased clock speed Multi-texturing S3TC, MPEG-2, FSAA Anti-Aliasing Without Anti-Aliasing With Anti-Aliasing GeForce2 2000: GeForce 2 MX – Cut pixel pipeline by 2, making it cost effective – Twinview – Compatible with MACs GeForce2 Jan 2001: Apple selected GeForce2 MX as default high-end graphics solution for Power Mac G4 August 2000: GeForce2 Ultra November 2000: GeForce2 Go December 2000: NVIDIA buys 3DFX GeForce3 2001: GeForce3 (NV20) – – – – – – 240 MHz Core/500 MHz Memory 57 million transistors 46-76 Gigaflops Vertex shader technology Pixel shader technology LightSpeed Memory architecture LightSpeed Memory Architecture GeForce4 2002: GeForce4 Ti (NV25) and MX (NV17) – Ti: 4200, 4400, 4600, and 4800 versions 63 million transistors Chip clock 225-300 MHz Memory Clock 500-650 MHz 75-100 million vertices/second GeForce FX November 2002: Geforce FX (NV30) – – – – – – 16 variations for different price ranges 125 million transistors 8 pixels/clock 1 tmu/pipe (16 textures/unit) 128 bit memory interface 128 MB/256 MB Memory size support GeForce 6 series GeForce 6 series (NV40 ) – – – – – 6200; 6600 GT and Ultra; 6800 GT, Ultra, and Ultra Extreme Core clock speed 450 MHz Memory clock speed 600 MHz 6 4-wide fp32 vector MADDs/ clock cycle vertex shader units 16 4-wide fp32 vector MADDs/ clock cycle pixel shader units GeForce 6 series Super scalar 16 pipe architecture CineFX3.0 engine All operations done in FP32 precision per component 200 Gigaflops (Compare this to the Itanium’s 6.4 Gigaflops) General Diagram (6800/NV40) TurboCache Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed TurboCache NV40 Vertex Processor An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle NV40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units Programmable 2D and Video Processor Can be used for video decoding and coding (IDCT, deinterlacing, color model transformations, etc.) Why NV40 series was better Massive parallelism Scalability – Computation Power – – Lower end products have fewer pixel pipes and fewer vertex shader units 222 million transistors First to comply with Microsoft’s DirectX 9 spec Dynamic Branching in pixel shaders Dynamic Branching Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines Dynamic Branching GeForce 7 series 7800 GT $449 7 vertex units 20 pixel pipelines Clock speed 400 MHz Memory clock speed 500 MHz 7800 GTX $600 8 vertex units 24 pixel pipelines Clock speed 430 MHz Memory clock speed 600 MHz GeForce 7800 302 million transistors 200 Gigaflops of multiply/add calculations per second 128-bit floating point precision through the entire rendering pipeline Fill Rate: 10.3 Gigatexels 860 million vertices/sec GeForce 7800 ALU Units in Pixel Processor Sub-unit 1: – NV40: textures data and can issue a MUL vector instruction or use its mini-ALU to issue a non-vector instruction – G70: same but also can issue a multiply/add Sub-unit 2: – NV40: can issue a multiply/add vector instruction or use its own mini-ALU to issue a non-vector instruction – G70: same GeForce 6 vs. GeForce 7 ALU Units – G70: 24 ALU Units – NV40: 16 ALU Units Register file: same size Texture samplers the same but when fetching large textures in preparation for filtering, G70's samplers have less latency pulling those textures out of memory GeForce 6 vs. GeForce 7 (speculative) Increased L2 texture cache (to around 12KB) Better cache re-use with larger textures, decompressing those larger textures into L1 faster Possibly offering more granularity in cache access by the GPU, to reduce texture bandwidth, speeding up rendering. GeForce 6 vs. GeForce 7 33 % more vertex units, each with more performance Improved vertex fetch unit (unconfirmed by Nvidia) Triangle setup and rasteriser optimized via the use of a new raster pattern (again unconfirmed by Nvidia) General Diagram (7800/G70) 32-bit IEEE floating-point throughout pipeline (NV40) Framebuffer Textures Fragment processor Vertex processor Interpolants GeForce 7800 (G70) supports 128 bit through entire pipeline! Hardware supports several other data types Fragment processor also supports: – 16-bit “half” floating point – 12-bit fixed point – These may be faster than 32-bit on some HW Framebuffer/textures also support: – Large variety of fixed-point formats – E.g., classical 8-bit per component – These formats use less memory bandwidth than FP32 How are current GPU’s different from CPU? GPU is a stream processor Multiple programmable processing units Connected by data flows Textures Framebuffer Fragment Processor Framebuffer Operations Vertex Processor Assembly & Rasterization Application How are current GPU’s different from CPU? Optimized for 4-vector arithmetic – Useful for graphics – colors, vectors, texcoords – Easy way to get high performance/cost – SIMD/MIMD GPU Memory Model vs CPU’s Much more restricted memory access – Allocate/free memory only before computation – Limited memory access during computation (kernel) Registers – Read/write Local memory – Does not exist Global memory – Read-only during computation – Write-only at end of computation (pre-computed address) Disk access – Does not exist GPU Memory Model Where is GPU Data Stored? – Vertex buffer – Frame buffer – Texture VS 3.0 GPUs Texture Vertex Buffer Vertex Processor Rasterizer Fragment Processor Frame Buffer(s) GPGPU and Motivation GPUs are fast… – Itanium: 6.4 GFLOPS – GeForceFX 7800: 200 GFLOPs – GPUs are getting faster, faster – CPUs: annual growth 1.5× decade growth 60× – GPUs: annual growth > 2.0× decade growth > 1000 Motivation: Computational Power GPU GPU CPU Courtesy Naga Govindaraju GPGPU Good for inherently parallel applications Rapidly evolving ISA and HW architecture – Largely secret Can’t simply “port” code written for the CPU! Programs are Shaders Bound by the specific hardware profile: – E.g. different cards have different supported hardware, OpenGL has different restrictions than DirectX, etc Hardware profiles change relatively drastically as new GPUs are developed – But typically new profiles only add features, so there is generally still backwards compatibility (but not always) Vertex processor 256 instructions per program originally (effectively higher with branching) – Now up to 65535 instructions Executes on all vertices Outputs new vertices or texture coordinates, etc Fragment Processor Flow Chart Fragment processor has flexible texture mapping Memory is accessible through texture reads Texture reads are just another instruction Allows computed texture coordinates, nested to arbitrary depth Allows multiple uses of a single texture unit Additional fragment processor capabilities Read access to window-space position Read/write access to fragment Z Built-in derivative instructions – Partial derivatives w.r.t. screen-space x or y – Useful for anti-aliasing Conditional fragment-kill instruction Multiple FP formats supported Fragment processor limitations Originally No branching – Now support dynamic branching (but it’s still costly) No indexed reads from registers – Use texture reads instead No memory writes Branching Instruction Costs (GeForce 6800) Fragment shaders Originally very limited in size (only 96 instructions), now expanded to 65535+ instructions New cards support dynamic branching (but it still incurs some performance penalty) Now have the ability to output to multiple render targets CineFX 4.0 Engine A redesigned vertex shader unit reduces the time to set up and perform geometry processing. A new pixel shader unit design can carry out twice as many floating-point operations and greatly accelerates other mathematical operations to increase throughput. An advanced texture unit incorporates new hardware algorithms and better caching to speed filtering and blending operations. Vertex Shaders The 7800 has 8 vertex shaders The Triangle Setup stage turns the vertex points into a triangle It also determines mathmatically the rasterization for each triangle Accelerating triangle setup increases the total throughput of the 3D pipeline Theoretical Rasterization Pattern of a Triangle New Pixel Shader – MADD Multiply and Accumulate are commonly used math functions in 3D graphics MADD stands for Multiply-ADD operations The 7800 can do twice the amount of MADD operations than previous GPUs could This allows developers to create much more complex visual effects Transparency Adaptive Supersampling Takes extra passes of thin-lined objects such as chain linked fences or trees to enhance quality Pixels inside of a polygon are usually not touched by anti-aliasing methods With this, a key set is devised, and those pixels are anti-aliased, creating a smoother image. Transparency Adaptive Supersampling Transparency Adaptive Multisampling Higher levels of performance, because it uses one texel to determine other subpixel values Not as high quality Supporting the Future The 7800 is already set up to support the new Microsoft Longhorn OS with some of the following advancements – Video post-processing – Real-time desktop compositing – Seamless multiple 3D applications – Accelerated antialiased text rendering – Special effects and animation Accelerated Graphics Port (AGP) The AGP is superior to the PCI because it provides a dedicated pathways between the slot and the processor Uses sideband addressing PCI must load a texture from the hard drive into the systems RAM, then from the RAM into the GNU framebuffer AGP can read textures directly from system RAM by “tricking” the CPU into believing the textures are in the framebuffer, when they are really in memory PCI Express Based on the PCI system, allowing for backwards compatibility Uses 1 bit, bi-directional lanes (PCI used a bus) Each lane can support 250 MB/s in each lane (4GB/s total) – AGP is only 2 GB/s Scalable Link Interface (SLI) Takes advantage of the PCI express bus, which will allow more than one discrete graphics device on the same PCI host Allows two of the same GeForce GPUs to run on one machine, thus “sharing” load. There are two modes for this – Split-frame Rendering (SFR) – Alternate-frame Rendering (AFR) Split-frame Rendering Has each GPU render a portion of the screen, split horizontally No extra latency Not necessarily evenly split – SFR is load shared, so it splits up the frame by the amount of work, not the size A large amount of overhead is involved, causing a max speed up of around 1.8 times Alternate-frame Rendering Avoids all the overhead problems of SFR Many buffer swaps Reliant on the speed of the processor Can cause latency issues Recommended mode by NVIDIA GeForce Go 7800 GTX The mobile version of the 7800 GTX Everything from the desktop release has been carried over to this Can switch between x1 and x16 lanes of PCI Express Uses PowerMizer 6.0, which allows this chip to operate in the same envelope as it’s predecessor, the 6800 GeForce Go 7800 – Power Issues Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz – This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game Questions?