ppt - The University of Akron

advertisement
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The
University of Akron.
 Make great images
 Intricate shapes
 Complex optical effects
 Seamless motion
 Make them fast
 Invent clever techniques
 Use every trick imaginable
 Build monster hardware
Eugene d’Eon, David Luebke, Eric Enderton, In Proc. EGSR 2007 and GPU Gems 3
History of GPUs – Slide 2
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
History of GPUs – Slide 3
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
History of GPUs – Slide 4
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
 Transform from
Depth Test & Blending
Framebuffer
“world space” to
“image space”
 Compute per-vertex
lighting
History of GPUs – Slide 5
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
 Convert geometric
representation
(vertex) to image
representation
(fragment)
 Interpolate pervertex quantities
across pixels
Framebuffer
History of GPUs – Slide 6
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
History of GPUs – Slide 7
Vertex
Rasterize
Pixel
Test & Blend
 Key abstraction of real-time
graphics
 Hardware used to look like this
 One chip/board per stage
 Fixed data flow through pipeline
Framebuffer
History of GPUs – Slide 8
Vertex
Rasterize
 Everything fixed function with a
certain number of modes
 Number of modes for each stage
grew over time
Pixel
 Hard to optimize hardware
Test & Blend
 Developers always wanted more
Framebuffer
flexibility
History of GPUs – Slide 9
Vertex
Rasterize
 Remains a key abstraction
 Hardware used to look like this
 Vertex and pixel processing became
Pixel
Test & Blend
programmable, new stages added
 GPU architecture increasingly
centers around shader execution
Framebuffer
History of GPUs – Slide 10
Vertex
Rasterize
Pixel
Test & Blend
 Exposing an (at first limited)
instruction set for some stages
 Limited instructions and
instruction types and no control
flow at first
 Expanded to full ISA
Framebuffer
History of GPUs – Slide 11
 Workload and programming model provide lots of
parallelism
 Applications provide large groups of vertices at once
 Vertices can be processed in parallel
 Apply same transform to all vertices
 Triangles contain many pixels
 Pixels from a triangle can be processed in parallel
 Apply same shader to all pixels
 Very efficient hardware to hide serialization
bottlenecks
History of GPUs – Slide 12
Pixel
Vrtx 0
Pixel 2
Pixel 3
Vrtx 2
Vrtx 1
Blend
Vertex
Pixel 0
Pixel 1
Blend
Raster
Vertex
Raster
History of GPUs – Slide 13
 Note that we do the same thing for lots of
pixels/vertices
Control
Control
Control
Control
Control
Control
ALU
ALU
ALU
ALU
ALU
ALU
Control
ALU
ALU
ALU
ALU
ALU
ALU
 A warp = 32 threads launched together
 Usually execute together as well
History of GPUs – Slide 14
 All this performance attracted developers
 To use GPUs, re-expressed their algorithms as general
purpose computations using GPUs and graphics API in
applications other than 3-D graphics
 Pretend to be graphics; disguise data as textures or
geometry, disguise algorithm as render passes
 Fool graphics pipeline to do computation to take
advantage of massive parallelism of GPU
 GPU accelerates critical path of application
History of GPUs – Slide 15
 Data parallel algorithms leverage GPU attributes
 Large data arrays, streaming throughput
 Fine-grain SIMD parallelism
 Low-latency floating point (FP) computation
 Applications – see http://GPGPU.org
 Game effects (FX) physics, image processing
 Physical modeling, computational engineering, matrix
algebra, convolution, correlation, sorting
History of GPUs – Slide 16
 Dealing with graphics API
Input Registers
Fragment Program
per thread
per Shader
per Context
 Working with the corner cases of
the graphics API
Texture
Constants
Temp Registers
 Addressing modes
 Limited texture size/dimension
 Shader capabilities
Output Registers
FB
Memory
 Limited outputs
 Instruction sets
 Lack of integer & bit ops
 Communication limited
 Between pixels
 Scatter a[i] = p
History of GPUs – Slide 17
 To use GPUs, re-expressed algorithms as graphics
computations
 Very tedious, limited usability
 Still had some very nice results
 This was the lead up to CUDA
History of GPUs – Slide 18
 General purpose programming model
 User kicks off batches of threads on the GPU
 GPU = dedicated super-threaded, massively data parallel
co-processor
 Targeted software stack
 Compute oriented drivers, language, and tools
History of GPUs – Slide 19
 Driver for loading computation programs into GPU
 Standalone Driver - Optimized for computation
 Interface designed for compute – graphics-free API
 Data sharing with OpenGL buffer objects
 Guaranteed maximum download & readback speeds
 Explicit GPU memory management
History of GPUs – Slide 20
CPU
(host)
GPU w/
local DRAM
(device)
History
21 of GPUs – Slide 21

8-series GPUs deliver 25 to 200+
GFLOPS
on compiled parallel C
applications

Available in laptops, desktops,
and clusters
GPU parallelism is doubling
every year
 Programming model scales
transparently
GeForce 8800

Tesla D870
History of GPUs – Slide 22


Programmable in C with CUDA tools
Multithreaded SPMD model uses application
data parallelism and thread parallelism
Tesla S870
History of GPUs – Slide 23
 GPUs evolve as hardware and software evolve
 Five stage graphics pipelining
 An example of GPGPU
 Intro to CUDA
History of GPUs – Slide 24
 Reading: Chapter 2, “Programming Massively Parallel
Processors” by Kirk and Hwu.
 Based on original material from
 The University of Illinois at Urbana-Champaign
 David Kirk, Wen-mei W. Hwu
 The University of Minnesota: Weijun Xiao
 Stanford University: Jared Hoberock, David Tarjan
 Revision history: last updated 5/24/2011.
History of GPUs – Slide 25
Download