Programming Massively Parallel Processors Chapter 2: GPU Computing History © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 1 CPU Host Host Interface Vertex Control GPU Vertex Cache VS/T&L A Fixed Function GPU Pipeline Triangle Setup Raster Shader Texture Cache ROP © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 FBI ECE408, University of Illinois, Urbana-Champaign Frame Buffer Memory 2 Texture Mapping Example Texture mapping example: painting a world map texture image onto a globe object. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 3 Programmable Vertex and Pixel Processors 3D Application or Game 3D API Commands CPU 3D API: OpenGL or Direct3D CPU – GPU Boundary GPU Command & Data Stream GPU Front End GPU Assembled Polygons, Lines, and Points Vertex Index Stream Primitive Assembly Pre-transformed Vertices Pixel Location Stream Rasterization & Interpolation Rasterized Transformed Pre-transformed Vertices Fragments Programmable Vertex Processor Pixel Updates Raster Operation s Framebuffer Transformed Fragments Programmable Fragment Processor An example of separate vertex processor and fragment processor in a programmable graphics pipeline © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 4 Unified Graphics Pipeline Host Data Assembler Setup / Rstr / ZCull SP SP SP TF SP TF L1 SP TF L1 SP SP Pixel Thread Issue SP TF © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign SP SP TF L1 L2 FB SP TF L1 L1 L2 FB SP SP TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB 5 G80 CUDA mode – A Device Example • Processors execute computing threads • New operating mode/HW interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign Load/store Load/store 6 CP U "Host" x Points: y z ...... main() Colours: Id: 1 2 r g b ...... ...... Memoy(data): GPU x y z ...... r g b ...... H.W CPU Code Vertex shader Pixel shader Identified table Id Name 1 Vsource 2 Vobj 3 Vposition 4 Vcolour 5 Psource © David Kirk/NVIDIA and Wen-mei W. Hwu,6 2007-2010Pobj ECE408, University of Illinois, Urbana-Champaign Location main()...... 1010....... 7 CUDA General Purpose Computation using GPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 8 What is (Historical) GPGPU ? • General Purpose computation using GPU and graphics API in applications other than 3D graphics – GPU accelerates critical path of application • Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Low-latency floating point (FP) computation • Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 9 Input Registers Fragment Program per thread per Shader per Context Texture Constants Temp Registers Output Registers FB Memory The restricted input and output capabilities of a shader programming model. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 10 Previous GPGPU Constraints • Dealing with graphics API – Working with the corner cases of the graphics API Input Registers Fragment Program • Addressing modes per thread per Shader per Context Texture Constants – Limited texture size/dimension Temp Registers • Shader capabilities – Limited outputs • Instruction sets Output Registers FB Memory – Lack of Integer & bit ops • Communication limited – Between pixels – Scatter a[i] = p © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 11 CUDA • “Compute Unified Device Architecture” • General purpose programming model – User kicks off batches of threads on the GPU – GPU = dedicated super-threaded, massively data parallel co-processor • Targeted software stack – Compute oriented drivers, language, and tools • Driver for loading computation programs into GPU – – – – – Standalone Driver - Optimized for computation Interface designed for compute – graphics-free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 498AL, University of Illinois, Urbana-Champaign 12