Graphics Hardware Kurt Akeley CS248 Lecture 14 8 November 2007 http://graphics.stanford.edu/courses/cs248-07/ Implementation = abstraction (from lecture 2) Application Application Data Assembler Vtx Thread Issue Vertex assembly Setup / Rstr / ZCull Prim Thread Issue Frag Thread Issue SP SP TF SP SP TF L1 SP SP TF L1 SP SP SP SP TF TF L1 SP TF L1 L1 SP SP SP TF L1 SP SP TF L1 L1 Thread Processor Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations L2 FB L2 FB L2 FB L2 FB L2 FB NVIDIA GeForce 8800 CS248 Lecture 14 Source : NVIDIA L2 FB Framebuffer OpenGL Pipeline Kurt Akeley, Fall 2007 Correspondence (by color) Applicationprogrammable parallel processor Fixed-function assembly processors Application Application Data Assembler this was missing Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex assembly SP SP TF SP SP TF L1 SP SP TF L1 SP SP SP SP TF TF L1 SP TF SP SP TF SP SP TF Fixed-function L1 L1 framebuffer operations L1 L1 SP L1 Thread Processor Vertex operations Primitive assembly Primitive operations Rasterization (fragment assembly) Fragment operations L2 FB L2 FB L2 FB L2 FB L2 FB NVIDIA GeForce 8800 CS248 Lecture 14 L2 FB Framebuffer OpenGL Pipeline Kurt Akeley, Fall 2007 Why does graphics hardware exist? Special-purpose hardware tends to disappear over time Lisp machines and CAD workstations of the 80s CISC CPUs iAPX432 (circa 1982) www.dvorak.org/blog/ CS248 Lecture 14 Symbolics Lisp Machines (circa 1984) www.abstractscience.freeserve.co.uk/symbolics/photos/ Kurt Akeley, Fall 2007 Why does graphics hardware exist? Graphics acceleration has been around for 40 years. Why do GPUs remain? Confluence of four things: Performance differentiation Work-load sufficiency GPUs are much faster than CPUs at 3-D rendering tasks The accelerated 3-D rendering tasks make up a significant portion of the overall processing (thus Amdahl’s law doesn’t limit the resulting performance increase). Strong market demand Customer demand for 3-D graphics performance is strong Driven by the games market Ubiquity With the help of standardized APIs/architectures (OpenGL and Direct3D) GPUs have achieved ubiquity in the PC market Inertia now works in favor of continued graphics hardware CS248 Lecture 14 Kurt Akeley, Fall 2007 NVIDIA 8800 Ultra Stream processors 128 Peak floating-point performance 400+ GFLOPS Memory 768 MB Memory bandwidth 103.7 GB/sec Triangle rate (vertex rate) 306 million/sec (est) Texture fill rate (fragment rate) 39.2 billion/sec CS248 Lecture 14 Kurt Akeley, Fall 2007 NVIDIA performance trends Year Product Tri rate CAGR Tex rate CAGR 1998 Riva ZX 3m - 100m - 1999 Riva TNT2 9m 3.0 350m 3.5 2000 GeForce2 GTS 25m 2.8 664m 1.9 2001 GeForce3 30m 1.2 800m 1.2 2002 GeForce Ti 4600 60m 2.0 1200m 1.5 2003 GeForce FX 167m 2.8 2000m 1.7 2004 GeForce 6800 Ultra 170m 1.0 6800m 2.7 2005 GeForce 7800 GTX 215m 1.2 6800m 1.0 2006 GeForce 7900 GTX 260m 1.3 15600m 2.3 2007 GeForce 8800 Ultra 306m 1.2 39200m 2.5 1.7 1.9 Yearly Growth is well above 1.5 (Moore’s Law) CS248 Lecture 14 Kurt Akeley, Fall 2007 SGI performance trends (depth buffered) Year Product 1984 Iris 2000 1988 GTX 1992 RealityEngine 1996 InfiniteReality ZTri rate 1k 135k CAGR Zbuf rate CAGR - 100k - 3.6 40m 4.5 2m 2.0 380m 1.8 12m 1.6 1000m 1.3 2.2 2.2 Yearly Growth well above 1.5 (Moore’s Law) CS248 Lecture 14 Kurt Akeley, Fall 2007 CPU performance CAGR has been slowing CS248 Lecture 14 Source: Hennessy and Patterson Kurt Akeley, Fall 2007 The situation could change … CPUs are becoming much more parallel CPU performance increase (1.2x to 1.5x per year) is low compared with the GPU increase (1.7x to 2x per year). This could change now with CPU parallelism (many-core) The vertex pipeline architecture is getting old Approaches such as ray tracing offer many advantages, but the vertex pipeline is poorly optimized for them The work-load argument is somewhat circular, because the brute-force algorithms employed by GPUs inflate their own performance demands GPUs have and will continue to evolve But a revolution is always possible CS248 Lecture 14 Kurt Akeley, Fall 2007 Outline The rest of this lecture is organized around the four ideas that most informed the design of modern GPUs (as enumerated by David Blythe in this lecture’s reading assignment): Parallelism Coherence Latency Programmability I’ll continue to use the NVIDIA 8800 as a specific example CS248 Lecture 14 Kurt Akeley, Fall 2007 Parallelism CS248 Lecture 14 Kurt Akeley, Fall 2007 Graphics is “embarrassingly parallel” Many separate tasks (the types I keep talking about) struct { float x,y,z,w; float r,g,b,a; } vertex; struct { vertex v0,v1,v2 } triangle; No “horizontal” dependencies, few “vertical” (in-order execution) CS248 Lecture 14 struct { short int x,y; float depth; float r,g,b,a; } fragment; struct { int depth; byte r,g,b,a; } pixel; Application Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display Kurt Akeley, Fall 2007 Data and task parallelism Data Parallelism Application Data parallelism Simultaneously doing the same thing to similar data E.g., transforming vertexes Some variance in “same thing” is possible Task parallelism Simultaneously doing different things E.g., the tasks (stages) of the vertex pipeline Vertex assembly Vertex operations Task Parallelism Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display CS248 Lecture 14 Kurt Akeley, Fall 2007 Trend from pipeline to data parallelism Coord, normal Transform Coordinate Transform Lighting Command Processor Clip testing 6-plane Clipping state Frustum Clipping Divide by w (clipping) Viewport Divide by w Prim. Assy. Viewport Clark “Geometry Engine” CS248 Lecture (1983)14 Round-robin Aggregation Backface cull SGI 4D/GTX (1988) SGI RealityEngine Kurt Akeley, Fall 2007 (1992) Load balancing Easy for data parallelism Challenging for task parallelism Static balance is difficult to achieve But is insufficient Mode changes affect execution time (e.g., complex lighting) Worse, data can affect execution time (e.g., clipping) Unified architectures ease pipeline balance Pipeline is virtual, processors assigned as required 8800 is unified Application Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer Display CS248 Lecture 14 Kurt Akeley, Fall 2007 Unified pipeline architecture Applicationprogrammable parallel processor Application Application Data Assembler this was missing Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue Vertex assembly SP SP TF SP SP TF L1 SP SP TF L1 SP SP SP SP TF TF L1 SP TF L1 L1 SP SP SP TF L1 SP SP TF L1 L1 Thread Processor Vertex operations Primitive assembly Primitive operations Rasterization (fragment assembly) Fragment operations L2 FB L2 FB L2 FB L2 FB L2 FB NVIDIA GeForce 8800 CS248 Lecture 14 L2 FB Framebuffer OpenGL Pipeline Kurt Akeley, Fall 2007 Queueing FIFO buffering (first-in, first-out) is provided between task stages Accommodates variation in execution time Provides elasticity to allow unified load balancing to work FIFOs can also be unified Share a single large memory with multiple head-tail pairs Application Vertex assembly FIFO Vertex operations FIFO Primitive assembly Allocate as required FIFO CS248 Lecture 14 Kurt Akeley, Fall 2007 In-order execution Work elements must be sequence stamped Can use FIFOs as reorder buffers as well CS248 Lecture 14 Kurt Akeley, Fall 2007 Coherence CS248 Lecture 14 Kurt Akeley, Fall 2007 Two aspects of coherence Data locality The data required for computation are “near by” Computational coherence Similar sequences of operations are being performed CS248 Lecture 14 Kurt Akeley, Fall 2007 Data locality Prior to texture mapping: Application Vertex pipeline was a stream processor Vertex assembly Each work element (vertex, primitive, fragment) carried all the state it needed Vertex operations Modal state was local to the pipeline stage Primitive assembly Assembly stages operated on adjacent work elements Data locality was inherent in this model Post texture mapping: All application-programmable stages have memory access (and use them) Primitive operations So the vertex pipeline is no longer a stream processor Data locality must be fought for … CS248 Lecture 14 Rasterization Fragment operations Framebuffer Display Kurt Akeley, Fall 2007 Post-texture mapping data locality (simplified) Modern memory (DRAM) operates in large blocks Memory is a 2-D array Access is to an entire row To make efficient use of memory bandwidth all the data in a block must be used Two things can be done: Aggregate read and write requests Memory controller and cache Complex part of GPU design Organize memory contents coherently (blocking) CS248 Lecture 14 Kurt Akeley, Fall 2007 Texture Blocking 6D Organization Cache Size 4x4 blocks Cache Line Size 4x4 texels (s1,t1) (s2,t2) (s3,t3) Address CS248 Lecture 14 base s1 t1 Source: Pat Hanrahan s2 t2 s3 t3 Kurt Akeley, Fall 2007 Computational coherence Data parallelism is computationally coherent Simultaneously doing the same thing to similar data Can share a single instruction sequencer with multiple data paths: struct { float x,y,z,w; float r,g,b,a; } vertex; Instruction fetch and execute struct { float x,y,z,w; float r,g,b,a; } vertex; SIMD – Single Instruction Multiple Data CS248 Lecture 14 Kurt Akeley, Fall 2007 SIMD processing One of eight 16-wide SIMD processors SP SP Why not use L1 one 128-wide processor? TF SP SP TF Data Assembler this was missing Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue SP TF L1 SP SP SP TF SP SP TF L1 L2 FB SP TF L1 L1 L2 FB SP SP TF L1 L2 FB SP SP TF L1 L2 FB SP L1 L2 FB Thread Processor Application L2 FB NVIDIA GeForce 8800 CS248 Lecture 14 Kurt Akeley, Fall 2007 SIMD conditional control flow The “shader” abstraction operates on each data element independently But SIMD implementation shares a single execution unit across multiple data elements If data elements in the same SIMD unit branch differently the execution unit must follow both paths (sequentially) The solution is predication: Both paths are executed Data paths are enabled only during their selected path Can be nested Performance is obviously lost! SIMD width is a compromise: Too wide too much performance loss due to predication Too narrow inefficient hardware implementation CS248 Lecture 14 Kurt Akeley, Fall 2007 Latency CS248 Lecture 14 Kurt Akeley, Fall 2007 Again two issues Overall rendering latency Typically measured in frames Of concern to application programmers Short on modern GPUs (more from Dave Oldcorn on this) But GPUs with longer rendering latencies have been designed Fun to talk about in a graphics architecture course Memory access latency Typically measured in clock cycles (and reaching thousands of those) Of direct concern to GPU architects and implementors But useful for application programmers to understand too! CS248 Lecture 14 Kurt Akeley, Fall 2007 Multi-threading Another kind of processor virtualization Unified GPUs share a single execution engine among multiple pipeline (task) stages Equivalent to CPU multi-tasking Multi-threading shares a single execution engine among multiple data-parallel work elements Similar to CPU hyper-threading The 8800 Ultra multi-threading mechanism is used to support both multi-tasking and data-parallel multi-threading A thread is a data structure: struct { int pc; // program counter float reg[n]; // live register state enum ctxt; // context information … } thread; CS248 Lecture 14 More live registers mean more memory usage Kurt Akeley, Fall 2007 Multi-threading SP SP TF SP SP this was missing Setup / Rstr / ZCull Vtx Thread Issue Prim Thread Issue Frag Thread Issue SP TF L1 SP SP SP SP SP TF TF L1 L2 FB SP SP SP SP Programmability TF L1 Data Assembler FB L1 L1 L2 TF L1 L2 FB TF L1 L2 FB SP SP TF L1 L2 FB Thread Processor Application L2 FB NVIDIA GeForce 8800 CS248 Lecture 14 Kurt Akeley, Fall 2007 Multi-threading hides latency Memory data available (dependency resolved) Memory reference (or resulting data dependency) struct { float x,y,z,w; float r,g,b,a; } vertex; CS248 Lecture 14 Blocked Threads Ready to Run Threads Processor stalls if no threads structare { ready to run. Instruction float result x,y,z,w; Possible of large fetch and float r,g,b,a; thread context (too many execute } vertex; live registers) Kurt Akeley, Fall 2007 Cache and thread store CPU Uses cache to hide memory latency Caches are huge (many MBs) GPU Uses cache to aggregate memory requests and maximize effective bandwidth Caches are relatively small Uses multithreading to hide memory latency Thread store is large Total memory usage on CPU and GPU chips is becoming similar … CS248 Lecture 14 Kurt Akeley, Fall 2007 Programmability CS248 Lecture 14 Kurt Akeley, Fall 2007 Programmability trade-offs Fixed-function: Efficient in die area and power dissipation Rigid in functionality Simple Programmable: Wasteful of die area and power Flexible and adaptable Able to manage complexity CS248 Lecture 14 Kurt Akeley, Fall 2007 Programmability is not new The Silicon Graphics VGX (1990) supported programmable vertex, primitive, and fragment operations. These operations are complex and require flexibility and adaptability The assembly operations are relatively simple and have few options Texture fetch and filter are also simple and benefit from fixedfunction implementation Application Vertex assembly Vertex operations Primitive assembly Primitive operations Rasterization Fragment operations What is new is allowing application developers to write vertex, primitive, and fragment shaders CS248 Lecture 14 Framebuffer OpenGL Pipeline Kurt Akeley, Fall 2007 Questions CS248 Lecture 14 Kurt Akeley, Fall 2007 Why insist on in-order processing? Even Direct3D 10 does Testability (repeatability) Invariance for multi-pass rendering (repeatability) Utility of painter’s algorithm State assignment! CS248 Lecture 14 Kurt Akeley, Fall 2007 Why can’t fragment shaders access the framebuffer? Equivalent to: why do other people’s block diagrams distinguish between fragment operations and framebuffer operations? Application Vertex assembly Vertex operations Simple answer: cache consistency Primitive assembly Primitive operations Rasterization Fragment operations Framebuffer OpenGL Pipeline CS248 Lecture 14 Kurt Akeley, Fall 2007 Why hasn’t tiled rendering caught on? It seems very attractive: Small framebuffer (that can be on-die in some cases) Deep framebuffer state (e.g., for transparency sorting) High performance Problems: May increase rendering latency Has difficulty with multi-pass algorithms Doesn’t match the OpenGL/Direct 3D abstraction CS248 Lecture 14 Kurt Akeley, Fall 2007 Summary Parallelism Graphics is inherently highly data and task parallel Challenges include in-order execution and load balancing Coherence Streaming is inherently data and instruction coherent But texture fetch breaks streaming model / data coherence Reference aggregation and memory layout restore data coherence Latency Modern GPU implementations have minimal rendering latency Multithreading (not caching) hides (the large) memory latency Programmability “Operation” stages are (and have long been) programmable Assembly stages, texture filtering, and ROPs typically are not Application programmability is new CS248 Lecture 14 Kurt Akeley, Fall 2007 Assignments Next lecture: Performance Tuning and Debugging (guest lecturer Dave Oldcorn, AMD) Reading assignment for Tuesday’s class: Sections 2.8 (vertex arrays) and 2.9 (buffer objects) of the OpenGL 2.1 specification Short office hours today CS248 Lecture 14 Kurt Akeley, Fall 2007 End CS248 Lecture 14 Kurt Akeley, Fall 2007