Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron. Make great images Intricate shapes Complex optical effects Seamless motion Make them fast Invent clever techniques Use every trick imaginable Build monster hardware Eugene d’Eon, David Luebke, Eric Enderton, In Proc. EGSR 2007 and GPU Gems 3 History of GPUs – Slide 2 Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer History of GPUs – Slide 3 Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer History of GPUs – Slide 4 Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Transform from Depth Test & Blending Framebuffer “world space” to “image space” Compute per-vertex lighting History of GPUs – Slide 5 Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Convert geometric representation (vertex) to image representation (fragment) Interpolate pervertex quantities across pixels Framebuffer History of GPUs – Slide 6 Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer History of GPUs – Slide 7 Vertex Rasterize Pixel Test & Blend Key abstraction of real-time graphics Hardware used to look like this One chip/board per stage Fixed data flow through pipeline Framebuffer History of GPUs – Slide 8 Vertex Rasterize Everything fixed function with a certain number of modes Number of modes for each stage grew over time Pixel Hard to optimize hardware Test & Blend Developers always wanted more Framebuffer flexibility History of GPUs – Slide 9 Vertex Rasterize Remains a key abstraction Hardware used to look like this Vertex and pixel processing became Pixel Test & Blend programmable, new stages added GPU architecture increasingly centers around shader execution Framebuffer History of GPUs – Slide 10 Vertex Rasterize Pixel Test & Blend Exposing an (at first limited) instruction set for some stages Limited instructions and instruction types and no control flow at first Expanded to full ISA Framebuffer History of GPUs – Slide 11 Workload and programming model provide lots of parallelism Applications provide large groups of vertices at once Vertices can be processed in parallel Apply same transform to all vertices Triangles contain many pixels Pixels from a triangle can be processed in parallel Apply same shader to all pixels Very efficient hardware to hide serialization bottlenecks History of GPUs – Slide 12 Pixel Vrtx 0 Pixel 2 Pixel 3 Vrtx 2 Vrtx 1 Blend Vertex Pixel 0 Pixel 1 Blend Raster Vertex Raster History of GPUs – Slide 13 Note that we do the same thing for lots of pixels/vertices Control Control Control Control Control Control ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU ALU ALU A warp = 32 threads launched together Usually execute together as well History of GPUs – Slide 14 All this performance attracted developers To use GPUs, re-expressed their algorithms as general purpose computations using GPUs and graphics API in applications other than 3-D graphics Pretend to be graphics; disguise data as textures or geometry, disguise algorithm as render passes Fool graphics pipeline to do computation to take advantage of massive parallelism of GPU GPU accelerates critical path of application History of GPUs – Slide 15 Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see http://GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting History of GPUs – Slide 16 Dealing with graphics API Input Registers Fragment Program per thread per Shader per Context Working with the corner cases of the graphics API Texture Constants Temp Registers Addressing modes Limited texture size/dimension Shader capabilities Output Registers FB Memory Limited outputs Instruction sets Lack of integer & bit ops Communication limited Between pixels Scatter a[i] = p History of GPUs – Slide 17 To use GPUs, re-expressed algorithms as graphics computations Very tedious, limited usability Still had some very nice results This was the lead up to CUDA History of GPUs – Slide 18 General purpose programming model User kicks off batches of threads on the GPU GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack Compute oriented drivers, language, and tools History of GPUs – Slide 19 Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute – graphics-free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management History of GPUs – Slide 20 CPU (host) GPU w/ local DRAM (device) History 21 of GPUs – Slide 21 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently GeForce 8800 Tesla D870 History of GPUs – Slide 22 Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism Tesla S870 History of GPUs – Slide 23 GPUs evolve as hardware and software evolve Five stage graphics pipelining An example of GPGPU Intro to CUDA History of GPUs – Slide 24 Reading: Chapter 2, “Programming Massively Parallel Processors” by Kirk and Hwu. Based on original material from The University of Illinois at Urbana-Champaign David Kirk, Wen-mei W. Hwu The University of Minnesota: Weijun Xiao Stanford University: Jared Hoberock, David Tarjan Revision history: last updated 5/24/2011. History of GPUs – Slide 25